Message ID | 20230823094757.gxvCEOBi@linutronix.de (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [BUG] Possible unsafe page_pool usage in octeontx2 | expand |
Hi Sebastian, Thanks for the report. On Wed, 23 Aug 2023 at 12:48, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > Hi, > > I've been looking at the page_pool locking. Apologies for any traumas we caused with that code :) > > page_pool_alloc_frag() -> page_pool_alloc_pages() -> > __page_pool_get_cached(): > > There core of the allocation is: > | /* Caller MUST guarantee safe non-concurrent access, e.g. softirq */ > | if (likely(pool->alloc.count)) { > | /* Fast-path */ > | page = pool->alloc.cache[--pool->alloc.count]; > > The access to the `cache' array and the `count' variable is not locked. > This is fine as long as there only one consumer per pool. In my > understanding the intention is to have one page_pool per NAPI callback > to ensure this. > > The pool can be filled in the same context (within allocation if the > pool is empty). There is also page_pool_recycle_in_cache() which fills > the pool from within skb free, for instance: > napi_consume_skb() -> skb_release_all() -> skb_release_data() -> > napi_frag_unref() -> page_pool_return_skb_page(). > > The last one has the following check here: > | napi = READ_ONCE(pp->p.napi); > | allow_direct = napi_safe && napi && > | READ_ONCE(napi->list_owner) == smp_processor_id(); > > This eventually ends in page_pool_recycle_in_cache() where it adds the > page to the cache buffer if the check above is true (and BH is disabled). > > napi->list_owner is set once NAPI is scheduled until the poll callback > completed. It is safe to add items to list because only one of the two > can run on a single CPU and the completion of them ensured by having BH > disabled the whole time. > > This breaks in octeontx2 where a worker is used to fill the buffer: > otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() -> > otx2_alloc_pool_buf() -> page_pool_alloc_frag(). > > BH is disabled but the add of a page can still happen while NAPI > callback runs on a remote CPU and so corrupting the index/ array. > > API wise I would suggest to > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 7ff80b80a6f9f..b50e219470a36 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, > page_pool_dma_sync_for_device(pool, page, > dma_sync_size); > > - if (allow_direct && in_softirq() && > + if (allow_direct && in_serving_softirq() && > page_pool_recycle_in_cache(page, pool)) > return NULL; > FWIW we used to have that check. commit 542bcea4be866b ("net: page_pool: use in_softirq() instead") changed that, so maybe we should revert that overall? > because the intention (as I understand it) is to be invoked from within > the NAPI callback (while softirq is served) and not if BH is just > disabled due to a lock or so. > > It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to > page_pool_alloc_pages() to spot usage outside of softirq. But this will > trigger in every driver since the same function is used in the open > callback to initially setup the HW. What about adding a check in the cached allocation path in order to skip the initial page allocation? Thanks /Ilias > > Sebastian
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > Sent: Wednesday, August 23, 2023 3:18 PM > Subject: [EXT] [BUG] Possible unsafe page_pool usage in octeontx2 > > This breaks in octeontx2 where a worker is used to fill the buffer: > otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() -> > otx2_alloc_pool_buf() -> page_pool_alloc_frag(). > Thanks. I agree. This path is called in octeontx2 driver only if __otx2_alloc_rbuf() function returns error In below path. otx2_napi_handler() --> pfvf->hw_ops->refill_pool_ptrs() ---> cn10k_refill_pool_ptrs() --> otx2_alloc_buffer() --> __otx2_alloc_rbuf(). As I understand, the problem is due to workqueue may get scheduled on other CPU. If we use BOUND workqueue, do you think this problem can be solved ? > BH is disabled but the add of a page can still happen while NAPI callback runs > on a remote CPU and so corrupting the index/ array. > > API wise I would suggest to > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c index > 7ff80b80a6f9f..b50e219470a36 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, > struct page *page, > page_pool_dma_sync_for_device(pool, page, > dma_sync_size); > > - if (allow_direct && in_softirq() && > + if (allow_direct && in_serving_softirq() && > page_pool_recycle_in_cache(page, pool)) > return NULL; > > because the intention (as I understand it) is to be invoked from within the > NAPI callback (while softirq is served) and not if BH is just disabled due to a > lock or so. Could you help me understand where in_softirq() check will break ? If we TX a packet (dev_queue_xmit()) in Process context on same core, in_serving_softirq() check will prevent it from recycling ?
On 2023-08-23 12:28:58 [+0000], Ratheesh Kannoth wrote: > > From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > Sent: Wednesday, August 23, 2023 3:18 PM > > Subject: [EXT] [BUG] Possible unsafe page_pool usage in octeontx2 > > > > This breaks in octeontx2 where a worker is used to fill the buffer: > > otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() -> > > otx2_alloc_pool_buf() -> page_pool_alloc_frag(). > > > As I understand, the problem is due to workqueue may get scheduled on > other CPU. If we use BOUND workqueue, do you think this problem can be > solved ? It would but is still open to less obvious races for instance if the IRQ/ NAPI is assigned to another CPU while the workqueue is scheduled. You would have to additional synchronisation to ensure that bad can happen. This does not make it any simpler nor prettier or serves as a good example. I would suggest to stay away from the lock-less buffer if not in NAPI and feed the pool->ring instead. > > BH is disabled but the add of a page can still happen while NAPI callback runs > > on a remote CPU and so corrupting the index/ array. > > > > API wise I would suggest to > > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c index > > 7ff80b80a6f9f..b50e219470a36 100644 > > --- a/net/core/page_pool.c > > +++ b/net/core/page_pool.c > > @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, > > struct page *page, > > page_pool_dma_sync_for_device(pool, page, > > dma_sync_size); > > > > - if (allow_direct && in_softirq() && > > + if (allow_direct && in_serving_softirq() && > > page_pool_recycle_in_cache(page, pool)) > > return NULL; > > > > because the intention (as I understand it) is to be invoked from within the > > NAPI callback (while softirq is served) and not if BH is just disabled due to a > > lock or so. > Could you help me understand where in_softirq() check will break ? If > we TX a packet (dev_queue_xmit()) in > Process context on same core, in_serving_softirq() check will prevent > it from recycling ? If a check is added to page_pool_alloc_pages() then it will trigger if you fill the buffer from your ->ndo_open() callback. Also, if you invoke dev_queue_xmit() from process context. But It will be added to &pool->ring instead. Sebastian
On 2023-08-23 14:36:06 [+0300], Ilias Apalodimas wrote: > Hi Sebastian, Hi Ilias, > > dma_sync_size); > > > > - if (allow_direct && in_softirq() && > > + if (allow_direct && in_serving_softirq() && > > page_pool_recycle_in_cache(page, pool)) > > return NULL; > > > > FWIW we used to have that check. > commit 542bcea4be866b ("net: page_pool: use in_softirq() instead") > changed that, so maybe we should revert that overall? The other changes look okay, this in particular I am not sure about. It would end up in the pool->ring instead of the lock-less cache. It still depends how it got here. If it comes from page_pool_return_skb_page() then the list_owner check will fail because it is not set for the threaded-NAPI case. If it was a real concern, the entry point must have been page_pool_put_full_page() or later. This basically assumes the same context. > > because the intention (as I understand it) is to be invoked from within > > the NAPI callback (while softirq is served) and not if BH is just > > disabled due to a lock or so. > > > > It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to > > page_pool_alloc_pages() to spot usage outside of softirq. But this will > > trigger in every driver since the same function is used in the open > > callback to initially setup the HW. > > What about adding a check in the cached allocation path in order to > skip the initial page allocation? Maybe. I'm a bit worried that this lock-less has no lockdep or similar checks if everyone plays by the rules. > Thanks > /Ilias Sebastian
On Wed, 23 Aug 2023 11:47:57 +0200 Sebastian Andrzej Siewior wrote: > The pool can be filled in the same context (within allocation if the > pool is empty). There is also page_pool_recycle_in_cache() which fills > the pool from within skb free, for instance: > napi_consume_skb() -> skb_release_all() -> skb_release_data() -> > napi_frag_unref() -> page_pool_return_skb_page(). > > The last one has the following check here: > | napi = READ_ONCE(pp->p.napi); > | allow_direct = napi_safe && napi && > | READ_ONCE(napi->list_owner) == smp_processor_id(); > > This eventually ends in page_pool_recycle_in_cache() where it adds the > page to the cache buffer if the check above is true (and BH is disabled). > > napi->list_owner is set once NAPI is scheduled until the poll callback > completed. It is safe to add items to list because only one of the two > can run on a single CPU and the completion of them ensured by having BH > disabled the whole time. One clarification - .napi will be NULL for otx2, it's an opt-in for drivers which allocate from NAPI, and AFAICT otx2 does not opt in.
(Cc Olek as he have changes in this code path) On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote: > Hi, > > I've been looking at the page_pool locking. > > page_pool_alloc_frag() -> page_pool_alloc_pages() -> > __page_pool_get_cached(): > > There core of the allocation is: > | /* Caller MUST guarantee safe non-concurrent access, e.g. softirq */ > | if (likely(pool->alloc.count)) { > | /* Fast-path */ > | page = pool->alloc.cache[--pool->alloc.count]; > > The access to the `cache' array and the `count' variable is not locked. > This is fine as long as there only one consumer per pool. In my > understanding the intention is to have one page_pool per NAPI callback > to ensure this. > Yes, the intention is a single PP instance is "bound" to one RX-NAPI. > The pool can be filled in the same context (within allocation if the > pool is empty). There is also page_pool_recycle_in_cache() which fills > the pool from within skb free, for instance: > napi_consume_skb() -> skb_release_all() -> skb_release_data() -> > napi_frag_unref() -> page_pool_return_skb_page(). > > The last one has the following check here: > | napi = READ_ONCE(pp->p.napi); > | allow_direct = napi_safe && napi && > | READ_ONCE(napi->list_owner) == smp_processor_id(); > > This eventually ends in page_pool_recycle_in_cache() where it adds the > page to the cache buffer if the check above is true (and BH is disabled). > > napi->list_owner is set once NAPI is scheduled until the poll callback > completed. It is safe to add items to list because only one of the two > can run on a single CPU and the completion of them ensured by having BH > disabled the whole time. > > This breaks in octeontx2 where a worker is used to fill the buffer: > otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() -> > otx2_alloc_pool_buf() -> page_pool_alloc_frag(). > This seems problematic! - this is NOT allowed. But otx2_pool_refill_task() is a work-queue, and I though it runs in process-context. This WQ process is not allowed to use the lockless PP cache. This seems to be a bug! The problematic part is otx2_alloc_rbuf() that disables BH: int otx2_alloc_rbuf(struct otx2_nic *pfvf, struct otx2_pool *pool, dma_addr_t *dma) { int ret; local_bh_disable(); ret = __otx2_alloc_rbuf(pfvf, pool, dma); local_bh_enable(); return ret; } The fix, can be to not do this local_bh_disable() in this driver? > BH is disabled but the add of a page can still happen while NAPI > callback runs on a remote CPU and so corrupting the index/ array. > > API wise I would suggest to > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 7ff80b80a6f9f..b50e219470a36 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, > page_pool_dma_sync_for_device(pool, page, > dma_sync_size); > > - if (allow_direct && in_softirq() && > + if (allow_direct && in_serving_softirq() && This is the "return/free/put" code path, where we have "allow_direct" as a protection in the API. API users are suppose to use page_pool_recycle_direct() to indicate this, but as some point we allowed APIs to expose 'allow_direct'. The PP-alloc side is more fragile, and maybe the in_serving_softirq() belongs there. > page_pool_recycle_in_cache(page, pool)) > return NULL; > > because the intention (as I understand it) is to be invoked from within > the NAPI callback (while softirq is served) and not if BH is just > disabled due to a lock or so. > True, and it used-to-be like this (in_serving_softirq), but as Ilias wrote it was changed recently. This was to support threaded-NAPI (in 542bcea4be866b ("net: page_pool: use in_softirq() instead")), which I understood was one of your (Sebastian's) use-cases. > It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to > page_pool_alloc_pages() to spot usage outside of softirq. But this will > trigger in every driver since the same function is used in the open > callback to initially setup the HW. > I'm very open to ideas of detecting this. Since mentioned commit PP is open to these kind of miss-uses of the API. One idea would be to leverage that NAPI napi->list_owner will have been set to something else than -1, when this is NAPI context. Getting hold of napi object, could be done via pp->p.napi (but as Jakub wrote this is opt-in ATM). --Jesper
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > Sent: Wednesday, August 23, 2023 6:25 PM > To: Ratheesh Kannoth <rkannoth@marvell.com> > Subject: Re: RE: [EXT] [BUG] Possible unsafe page_pool usage in octeontx2 > I would suggest to stay away from the lock-less buffer if not in NAPI and feed > the pool->ring instead. As Jacub explained, allow_direct will be false as pp->p.napi is 0. So there is no lockless addition. I think, we don’t have to fix the page pool alloc() in workqueue issue. -Ratheesh
[...] > > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > > index 7ff80b80a6f9f..b50e219470a36 100644 > > --- a/net/core/page_pool.c > > +++ b/net/core/page_pool.c > > @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, > > page_pool_dma_sync_for_device(pool, page, > > dma_sync_size); > > > > - if (allow_direct && in_softirq() && > > + if (allow_direct && in_serving_softirq() && > > This is the "return/free/put" code path, where we have "allow_direct" as > a protection in the API. API users are suppose to use > page_pool_recycle_direct() to indicate this, but as some point we > allowed APIs to expose 'allow_direct'. > > The PP-alloc side is more fragile, and maybe the in_serving_softirq() > belongs there. > > > page_pool_recycle_in_cache(page, pool)) > > return NULL; > > > > because the intention (as I understand it) is to be invoked from within > > the NAPI callback (while softirq is served) and not if BH is just > > disabled due to a lock or so. > > > > True, and it used-to-be like this (in_serving_softirq), but as Ilias > wrote it was changed recently. This was to support threaded-NAPI (in > 542bcea4be866b ("net: page_pool: use in_softirq() instead")), which > I understood was one of your (Sebastian's) use-cases. > > > > It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to > > page_pool_alloc_pages() to spot usage outside of softirq. But this will > > trigger in every driver since the same function is used in the open > > callback to initially setup the HW. > > > > I'm very open to ideas of detecting this. Since mentioned commit PP is > open to these kind of miss-uses of the API. > > One idea would be to leverage that NAPI napi->list_owner will have been > set to something else than -1, when this is NAPI context. Getting hold > of napi object, could be done via pp->p.napi (but as Jakub wrote this is > opt-in ATM). I mentioned this earlier, but can't we add the softirq check in __page_pool_get_cached()? In theory, when a driver comes up and allocates pages to fill in its descriptors it will call page_pool_alloc_pages(). That will go through the slow allocation path, fill up the caches, and return the last page. After that, most of the allocations will be served by __page_pool_get_cached(), and this is supposed to be running during the driver Rx routine which runs under NAPI. So eventually we will hit that warning. Thanks /Ilias > > --Jesper
On Thu, 24 Aug 2023 at 10:21, Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > > [...] > > > > > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > > > index 7ff80b80a6f9f..b50e219470a36 100644 > > > --- a/net/core/page_pool.c > > > +++ b/net/core/page_pool.c > > > @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, > > > page_pool_dma_sync_for_device(pool, page, > > > dma_sync_size); > > > > > > - if (allow_direct && in_softirq() && > > > + if (allow_direct && in_serving_softirq() && > > > > This is the "return/free/put" code path, where we have "allow_direct" as > > a protection in the API. API users are suppose to use > > page_pool_recycle_direct() to indicate this, but as some point we > > allowed APIs to expose 'allow_direct'. > > > > The PP-alloc side is more fragile, and maybe the in_serving_softirq() > > belongs there. > > > > > page_pool_recycle_in_cache(page, pool)) > > > return NULL; > > > > > > because the intention (as I understand it) is to be invoked from within > > > the NAPI callback (while softirq is served) and not if BH is just > > > disabled due to a lock or so. > > > > > > > True, and it used-to-be like this (in_serving_softirq), but as Ilias > > wrote it was changed recently. This was to support threaded-NAPI (in > > 542bcea4be866b ("net: page_pool: use in_softirq() instead")), which > > I understood was one of your (Sebastian's) use-cases. > > > > > > > It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to > > > page_pool_alloc_pages() to spot usage outside of softirq. But this will > > > trigger in every driver since the same function is used in the open > > > callback to initially setup the HW. > > > > > > > I'm very open to ideas of detecting this. Since mentioned commit PP is > > open to these kind of miss-uses of the API. > > > > One idea would be to leverage that NAPI napi->list_owner will have been > > set to something else than -1, when this is NAPI context. Getting hold > > of napi object, could be done via pp->p.napi (but as Jakub wrote this is > > opt-in ATM). > > I mentioned this earlier, but can't we add the softirq check in > __page_pool_get_cached()? > In theory, when a driver comes up and allocates pages to fill in its > descriptors it will call page_pool_alloc_pages(). That will go > through the slow allocation path, fill up the caches, and return the > last page. After that, most of the allocations will be served by > __page_pool_get_cached(), and this is supposed to be running during > the driver Rx routine which runs under NAPI. So eventually we will > hit that warning. Right... Scratch that, this will still warn on the initial allocation. The first descriptor will get a page of the slow path, but the rest will be filled via the caches. /Ilias > > Thanks > /Ilias > > > > > --Jesper
From: Jesper Dangaard Brouer <hawk@kernel.org> Date: Wed, 23 Aug 2023 21:45:04 +0200 > (Cc Olek as he have changes in this code path) Thanks! I was reading the thread a bit on LKML, but being in the CC list is more convenient :D > > On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote: >> Hi, >> >> I've been looking at the page_pool locking. >> >> page_pool_alloc_frag() -> page_pool_alloc_pages() -> >> __page_pool_get_cached(): >> >> There core of the allocation is: >> | /* Caller MUST guarantee safe non-concurrent access, e.g. >> softirq */ >> | if (likely(pool->alloc.count)) { >> | /* Fast-path */ >> | page = pool->alloc.cache[--pool->alloc.count]; >> >> The access to the `cache' array and the `count' variable is not locked. >> This is fine as long as there only one consumer per pool. In my >> understanding the intention is to have one page_pool per NAPI callback >> to ensure this. >> > > Yes, the intention is a single PP instance is "bound" to one RX-NAPI. Isn't that also a misuse of page_pool->p.napi? I thought it can be set only when page allocation and cache refill happen both inside the same NAPI polling function. Otx2 uses workqueues to refill the queues, meaning that consumer and producer can happen in different contexts or even threads and it shouldn't set p.napi. > > >> The pool can be filled in the same context (within allocation if the >> pool is empty). There is also page_pool_recycle_in_cache() which fills >> the pool from within skb free, for instance: >> napi_consume_skb() -> skb_release_all() -> skb_release_data() -> >> napi_frag_unref() -> page_pool_return_skb_page(). >> >> The last one has the following check here: >> | napi = READ_ONCE(pp->p.napi); >> | allow_direct = napi_safe && napi && >> | READ_ONCE(napi->list_owner) == smp_processor_id(); >> >> This eventually ends in page_pool_recycle_in_cache() where it adds the >> page to the cache buffer if the check above is true (and BH is disabled). >> >> napi->list_owner is set once NAPI is scheduled until the poll callback >> completed. It is safe to add items to list because only one of the two >> can run on a single CPU and the completion of them ensured by having BH >> disabled the whole time. >> >> This breaks in octeontx2 where a worker is used to fill the buffer: >> otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() -> >> otx2_alloc_pool_buf() -> page_pool_alloc_frag(). >> > > This seems problematic! - this is NOT allowed. > > But otx2_pool_refill_task() is a work-queue, and I though it runs in > process-context. This WQ process is not allowed to use the lockless PP > cache. This seems to be a bug! > > The problematic part is otx2_alloc_rbuf() that disables BH: > > int otx2_alloc_rbuf(struct otx2_nic *pfvf, struct otx2_pool *pool, > dma_addr_t *dma) > { > int ret; > > local_bh_disable(); > ret = __otx2_alloc_rbuf(pfvf, pool, dma); > local_bh_enable(); > return ret; > } > > The fix, can be to not do this local_bh_disable() in this driver? > >> BH is disabled but the add of a page can still happen while NAPI >> callback runs on a remote CPU and so corrupting the index/ array. >> >> API wise I would suggest to >> >> diff --git a/net/core/page_pool.c b/net/core/page_pool.c >> index 7ff80b80a6f9f..b50e219470a36 100644 >> --- a/net/core/page_pool.c >> +++ b/net/core/page_pool.c >> @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, >> struct page *page, >> page_pool_dma_sync_for_device(pool, page, >> dma_sync_size); >> - if (allow_direct && in_softirq() && >> + if (allow_direct && in_serving_softirq() && > > This is the "return/free/put" code path, where we have "allow_direct" as > a protection in the API. API users are suppose to use > page_pool_recycle_direct() to indicate this, but as some point we > allowed APIs to expose 'allow_direct'. > > The PP-alloc side is more fragile, and maybe the in_serving_softirq() > belongs there. > >> page_pool_recycle_in_cache(page, pool)) >> return NULL; >> because the intention (as I understand it) is to be invoked from within >> the NAPI callback (while softirq is served) and not if BH is just >> disabled due to a lock or so. >> > > True, and it used-to-be like this (in_serving_softirq), but as Ilias > wrote it was changed recently. This was to support threaded-NAPI (in > 542bcea4be866b ("net: page_pool: use in_softirq() instead")), which > I understood was one of your (Sebastian's) use-cases. > > >> It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to >> page_pool_alloc_pages() to spot usage outside of softirq. But this will >> trigger in every driver since the same function is used in the open >> callback to initially setup the HW. >> > > I'm very open to ideas of detecting this. Since mentioned commit PP is > open to these kind of miss-uses of the API. > > One idea would be to leverage that NAPI napi->list_owner will have been > set to something else than -1, when this is NAPI context. Getting hold > of napi object, could be done via pp->p.napi (but as Jakub wrote this is > opt-in ATM). > > --Jesper Thanks, Olek
On 23/08/2023 21.45, Jesper Dangaard Brouer wrote: > On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote: [...] >> >> This breaks in octeontx2 where a worker is used to fill the buffer: >> otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() -> >> otx2_alloc_pool_buf() -> page_pool_alloc_frag(). >> > > This seems problematic! - this is NOT allowed. > > But otx2_pool_refill_task() is a work-queue, and I though it runs in > process-context. This WQ process is not allowed to use the lockless PP > cache. This seems to be a bug! > > The problematic part is otx2_alloc_rbuf() that disables BH: > > int otx2_alloc_rbuf(struct otx2_nic *pfvf, struct otx2_pool *pool, > dma_addr_t *dma) > { > int ret; > > local_bh_disable(); > ret = __otx2_alloc_rbuf(pfvf, pool, dma); > local_bh_enable(); > return ret; > } > > The fix, can be to not do this local_bh_disable() in this driver? Correcting myself. It is not a fix to remove this local_bh_disable(). (which is obvious now I read the code again). This WQ process is not allowed to use the page_pool_alloc() API this way (from a work-queue). The PP alloc-side API must only be used under NAPI protection. Thanks for spotting this Sebastian! Will/can any of the Cc'ed Marvell people work on a fix? --Jesper
On 24/08/2023 17.26, Alexander Lobakin wrote: > From: Jesper Dangaard Brouer<hawk@kernel.org> > Date: Wed, 23 Aug 2023 21:45:04 +0200 > >> (Cc Olek as he have changes in this code path) > Thanks! I was reading the thread a bit on LKML, but being in the CC list > is more convenient :D > :D >> On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote: >>> Hi, >>> >>> I've been looking at the page_pool locking. >>> >>> page_pool_alloc_frag() -> page_pool_alloc_pages() -> >>> __page_pool_get_cached(): >>> >>> There core of the allocation is: >>> | /* Caller MUST guarantee safe non-concurrent access, e.g. >>> softirq */ >>> | if (likely(pool->alloc.count)) { >>> | /* Fast-path */ >>> | page = pool->alloc.cache[--pool->alloc.count]; >>> >>> The access to the `cache' array and the `count' variable is not locked. >>> This is fine as long as there only one consumer per pool. In my >>> understanding the intention is to have one page_pool per NAPI callback >>> to ensure this. >>> >> Yes, the intention is a single PP instance is "bound" to one RX-NAPI. > > Isn't that also a misuse of page_pool->p.napi? I thought it can be set > only when page allocation and cache refill happen both inside the same > NAPI polling function. Otx2 uses workqueues to refill the queues, > meaning that consumer and producer can happen in different contexts or > even threads and it shouldn't set p.napi. > As Jakub wrote this otx2 driver doesn't set p.napi (so that part of the problem cannot happen). That said using workqueues to refill the queues is not compatible with using page_pool_alloc APIs. This need to be fixed in driver! --Jesper
From: Jesper Dangaard Brouer <hawk@kernel.org> Date: Fri, 25 Aug 2023 15:22:05 +0200 > > > On 24/08/2023 17.26, Alexander Lobakin wrote: >> From: Jesper Dangaard Brouer<hawk@kernel.org> >> Date: Wed, 23 Aug 2023 21:45:04 +0200 >> >>> (Cc Olek as he have changes in this code path) >> Thanks! I was reading the thread a bit on LKML, but being in the CC list >> is more convenient :D >> > > :D > >>> On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote: >>>> Hi, >>>> >>>> I've been looking at the page_pool locking. >>>> >>>> page_pool_alloc_frag() -> page_pool_alloc_pages() -> >>>> __page_pool_get_cached(): >>>> >>>> There core of the allocation is: >>>> | /* Caller MUST guarantee safe non-concurrent access, e.g. >>>> softirq */ >>>> | if (likely(pool->alloc.count)) { >>>> | /* Fast-path */ >>>> | page = pool->alloc.cache[--pool->alloc.count]; >>>> >>>> The access to the `cache' array and the `count' variable is not locked. >>>> This is fine as long as there only one consumer per pool. In my >>>> understanding the intention is to have one page_pool per NAPI callback >>>> to ensure this. >>>> >>> Yes, the intention is a single PP instance is "bound" to one RX-NAPI. >> >> Isn't that also a misuse of page_pool->p.napi? I thought it can be set >> only when page allocation and cache refill happen both inside the same >> NAPI polling function. Otx2 uses workqueues to refill the queues, >> meaning that consumer and producer can happen in different contexts or >> even threads and it shouldn't set p.napi. >> > > As Jakub wrote this otx2 driver doesn't set p.napi (so that part of the > problem cannot happen). Oh I'm dumb :z > > That said using workqueues to refill the queues is not compatible with > using page_pool_alloc APIs. This need to be fixed in driver! Hmmm I'm wondering now, how should we use Page Pool if we want to run consume and alloc routines separately? Do you want to say we can't use Page Pool in that case at all? Quoting other your reply: > This WQ process is not allowed to use the page_pool_alloc() API this > way (from a work-queue). The PP alloc-side API must only be used > under NAPI protection. Who did say that? If I don't set p.napi, how is Page Pool then tied to NAPI? Moreover, when you initially do an ifup/.ndo_open, you have to fill your Rx queues. It's process context and it can happen on whichever CPU. Do you mean I can't allocate pages in .ndo_open? :D > > --Jesper Thanks, Olek
On 25/08/2023 15.38, Alexander Lobakin wrote: > From: Jesper Dangaard Brouer <hawk@kernel.org> > Date: Fri, 25 Aug 2023 15:22:05 +0200 > >> >> >> On 24/08/2023 17.26, Alexander Lobakin wrote: >>> From: Jesper Dangaard Brouer<hawk@kernel.org> >>> Date: Wed, 23 Aug 2023 21:45:04 +0200 >>> >>>> (Cc Olek as he have changes in this code path) >>> Thanks! I was reading the thread a bit on LKML, but being in the CC list >>> is more convenient :D >>> >> >> :D >> >>>> On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote: >>>>> Hi, >>>>> >>>>> I've been looking at the page_pool locking. >>>>> >>>>> page_pool_alloc_frag() -> page_pool_alloc_pages() -> >>>>> __page_pool_get_cached(): >>>>> >>>>> There core of the allocation is: >>>>> | /* Caller MUST guarantee safe non-concurrent access, e.g. >>>>> softirq */ >>>>> | if (likely(pool->alloc.count)) { >>>>> | /* Fast-path */ >>>>> | page = pool->alloc.cache[--pool->alloc.count]; >>>>> >>>>> The access to the `cache' array and the `count' variable is not locked. >>>>> This is fine as long as there only one consumer per pool. In my >>>>> understanding the intention is to have one page_pool per NAPI callback >>>>> to ensure this. >>>>> >>>> Yes, the intention is a single PP instance is "bound" to one RX-NAPI. >>> >>> Isn't that also a misuse of page_pool->p.napi? I thought it can be set >>> only when page allocation and cache refill happen both inside the same >>> NAPI polling function. Otx2 uses workqueues to refill the queues, >>> meaning that consumer and producer can happen in different contexts or >>> even threads and it shouldn't set p.napi. >>> >> >> As Jakub wrote this otx2 driver doesn't set p.napi (so that part of the >> problem cannot happen). > > Oh I'm dumb :z > >> >> That said using workqueues to refill the queues is not compatible with >> using page_pool_alloc APIs. This need to be fixed in driver! > > Hmmm I'm wondering now, how should we use Page Pool if we want to run > consume and alloc routines separately? Do you want to say we can't use > Page Pool in that case at all? Quoting other your reply: > >> This WQ process is not allowed to use the page_pool_alloc() API this >> way (from a work-queue). The PP alloc-side API must only be used >> under NAPI protection. > > Who did say that? If I don't set p.napi, how is Page Pool then tied to NAPI? *I* say that (as the PP inventor) as that was the design and intent, that this is tied to a NAPI instance and rely on the NAPI protection to make it safe to do lockless access to this cache array. > Moreover, when you initially do an ifup/.ndo_open, you have to fill your > Rx queues. It's process context and it can happen on whichever CPU. > Do you mean I can't allocate pages in .ndo_open? :D True, that all driver basically allocate from this *before* the RX-ring / NAPI is activated. That is safe and "allowed" given the driver RX-ring is not active yet. This use-case unfortunately also make it harder to add something to the PP API, that detect miss-usage of the API. Looking again at the driver otx2_txrx.c NAPI code path also calls PP directly in otx2_napi_handler() will call refill_pool_ptrs() -> otx2_refill_pool_ptrs() -> otx2_alloc_buffer() -> __otx2_alloc_rbuf() -> if (pool->page_pool) otx2_alloc_pool_buf() -> page_pool_alloc_frag(). The function otx2_alloc_buffer() can also choose to start a WQ, that also called PP alloc API this time via otx2_alloc_rbuf() that have BH-disable. Like Sebastian, I don't think this is safe! --Jesper This can be a workaround fix: $ git diff diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c index dce3cea00032..ab7ca146fddf 100644 --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c @@ -578,6 +578,10 @@ int otx2_alloc_buffer(struct otx2_nic *pfvf, struct otx2_cq_queue *cq, struct refill_work *work; struct delayed_work *dwork; + /* page_pool alloc API cannot be used from WQ */ + if (cq->rbpool->page_pool) + return -ENOMEM; + work = &pfvf->refill_wrk[cq->cq_idx]; dwork = &work->pool_refill_work; /* Schedule a task if no other task is running */
On Fri, 25 Aug 2023 19:25:42 +0200 Jesper Dangaard Brouer wrote: > >> This WQ process is not allowed to use the page_pool_alloc() API this > >> way (from a work-queue). The PP alloc-side API must only be used > >> under NAPI protection. > > > > Who did say that? If I don't set p.napi, how is Page Pool then tied to NAPI? > > *I* say that (as the PP inventor) as that was the design and intent, > that this is tied to a NAPI instance and rely on the NAPI protection to > make it safe to do lockless access to this cache array. Absolutely no objection to us making the NAPI / bh context a requirement past the startup stage, but just to be sure I understand the code - technically if the driver never recycles direct, does not set the NAPI, does not use xdp_return_frame_rx_napi etc. - the cache is always empty so we good? I wonder if we can add a check like "mark the pool as BH-only on first BH use, and WARN() on process use afterwards". But I'm not sure what CONFIG you'd accept that being under ;)
From: Jakub Kicinski <kuba@kernel.org> Date: Fri, 25 Aug 2023 17:42:58 -0700 > On Fri, 25 Aug 2023 19:25:42 +0200 Jesper Dangaard Brouer wrote: >>>> This WQ process is not allowed to use the page_pool_alloc() API this >>>> way (from a work-queue). The PP alloc-side API must only be used >>>> under NAPI protection. >>> >>> Who did say that? If I don't set p.napi, how is Page Pool then tied to NAPI? >> >> *I* say that (as the PP inventor) as that was the design and intent, >> that this is tied to a NAPI instance and rely on the NAPI protection to >> make it safe to do lockless access to this cache array. > > Absolutely no objection to us making the NAPI / bh context a requirement > past the startup stage, but just to be sure I understand the code - > technically if the driver never recycles direct, does not set the NAPI, > does not use xdp_return_frame_rx_napi etc. - the cache is always empty > so we good? +1, I don't say Otx2 is correct, but I don't see any issues in having consumer and producer running on different cores and in different contexes, as long as p.napi is not set. Split queue model is trending and I don't see reasons why PP may require serializing -> effectively making it work the same way as "single queue" mode works. Esp. given that we have ptr_ring with locks, not only the direct cache. > > I wonder if we can add a check like "mark the pool as BH-only on first > BH use, and WARN() on process use afterwards". But I'm not sure what Why do we use spin_lock_bh() and friends then and check in_softirq(), if PP can work only in one context? > CONFIG you'd accept that being under ;) Thanks, Olek
From: Jesper Dangaard Brouer <hawk@kernel.org> Date: Fri, 25 Aug 2023 19:25:42 +0200 > > > On 25/08/2023 15.38, Alexander Lobakin wrote: >> From: Jesper Dangaard Brouer <hawk@kernel.org> >> Date: Fri, 25 Aug 2023 15:22:05 +0200 >> >>> >>> >>> On 24/08/2023 17.26, Alexander Lobakin wrote: >>>> From: Jesper Dangaard Brouer<hawk@kernel.org> >>>> Date: Wed, 23 Aug 2023 21:45:04 +0200 >>>> >>>>> (Cc Olek as he have changes in this code path) >>>> Thanks! I was reading the thread a bit on LKML, but being in the CC >>>> list >>>> is more convenient :D >>>> >>> >>> :D >>> >>>>> On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote: >>>>>> Hi, >>>>>> >>>>>> I've been looking at the page_pool locking. >>>>>> >>>>>> page_pool_alloc_frag() -> page_pool_alloc_pages() -> >>>>>> __page_pool_get_cached(): >>>>>> >>>>>> There core of the allocation is: >>>>>> | /* Caller MUST guarantee safe non-concurrent access, e.g. >>>>>> softirq */ >>>>>> | if (likely(pool->alloc.count)) { >>>>>> | /* Fast-path */ >>>>>> | page = pool->alloc.cache[--pool->alloc.count]; >>>>>> >>>>>> The access to the `cache' array and the `count' variable is not >>>>>> locked. >>>>>> This is fine as long as there only one consumer per pool. In my >>>>>> understanding the intention is to have one page_pool per NAPI >>>>>> callback >>>>>> to ensure this. >>>>>> >>>>> Yes, the intention is a single PP instance is "bound" to one RX-NAPI. >>>> >>>> Isn't that also a misuse of page_pool->p.napi? I thought it can be set >>>> only when page allocation and cache refill happen both inside the same >>>> NAPI polling function. Otx2 uses workqueues to refill the queues, >>>> meaning that consumer and producer can happen in different contexts or >>>> even threads and it shouldn't set p.napi. >>>> >>> >>> As Jakub wrote this otx2 driver doesn't set p.napi (so that part of the >>> problem cannot happen). >> >> Oh I'm dumb :z >> >>> >>> That said using workqueues to refill the queues is not compatible with >>> using page_pool_alloc APIs. This need to be fixed in driver! >> >> Hmmm I'm wondering now, how should we use Page Pool if we want to run >> consume and alloc routines separately? Do you want to say we can't use >> Page Pool in that case at all? Quoting other your reply: >> >>> This WQ process is not allowed to use the page_pool_alloc() API this >>> way (from a work-queue). The PP alloc-side API must only be used >>> under NAPI protection. >> >> Who did say that? If I don't set p.napi, how is Page Pool then tied to >> NAPI? > > *I* say that (as the PP inventor) as that was the design and intent, > *I* > the PP inventor Sure I got that a couple years ago, no need to keep reminding me that xD > that this is tied to a NAPI instance and rely on the NAPI protection to > make it safe to do lockless access to this cache array. > >> Moreover, when you initially do an ifup/.ndo_open, you have to fill your >> Rx queues. It's process context and it can happen on whichever CPU. >> Do you mean I can't allocate pages in .ndo_open? :D > > True, that all driver basically allocate from this *before* the RX-ring > / NAPI is activated. That is safe and "allowed" given the driver > RX-ring is not active yet. This use-case unfortunately also make it > harder to add something to the PP API, that detect miss-usage of the API. > > Looking again at the driver otx2_txrx.c NAPI code path also calls PP > directly in otx2_napi_handler() will call refill_pool_ptrs() -> > otx2_refill_pool_ptrs() -> otx2_alloc_buffer() -> __otx2_alloc_rbuf() -> > if (pool->page_pool) otx2_alloc_pool_buf() -> page_pool_alloc_frag(). > > The function otx2_alloc_buffer() can also choose to start a WQ, that > also called PP alloc API this time via otx2_alloc_rbuf() that have > BH-disable. Like Sebastian, I don't think this is safe! Disabling BH doesn't look correct to me, but I don't see issues in having consumer and producer working on different cores, as long as they use ptr_ring with locks. > > --Jesper > > This can be a workaround fix: > > $ git diff > diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c > b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c > index dce3cea00032..ab7ca146fddf 100644 > --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c > +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c > @@ -578,6 +578,10 @@ int otx2_alloc_buffer(struct otx2_nic *pfvf, struct > otx2_cq_queue *cq, > struct refill_work *work; > struct delayed_work *dwork; > > + /* page_pool alloc API cannot be used from WQ */ > + if (cq->rbpool->page_pool) > + return -ENOMEM; I believe that breaks the driver? > + > work = &pfvf->refill_wrk[cq->cq_idx]; > dwork = &work->pool_refill_work; > /* Schedule a task if no other task is running */ Thanks, Olek
On 26/08/2023 02.42, Jakub Kicinski wrote: > On Fri, 25 Aug 2023 19:25:42 +0200 Jesper Dangaard Brouer wrote: >>>> This WQ process is not allowed to use the page_pool_alloc() API this >>>> way (from a work-queue). The PP alloc-side API must only be used >>>> under NAPI protection. >>> >>> Who did say that? If I don't set p.napi, how is Page Pool then tied to NAPI? >> >> *I* say that (as the PP inventor) as that was the design and intent, >> that this is tied to a NAPI instance and rely on the NAPI protection to >> make it safe to do lockless access to this cache array. > > Absolutely no objection to us making the NAPI / bh context a requirement > past the startup stage, but just to be sure I understand the code - > technically if the driver never recycles direct, does not set the NAPI, > does not use xdp_return_frame_rx_napi etc. - the cache is always empty > so we good? > Nope cache is NOT always empty, the PP cache will be refilled if empty. Thus, PP alloc side code will touch/use pool->alloc.cache[]. See two places in code with comment: /* Return last page */. The PP cache is always refilled; Either from ptr_ring or via page-allocators bulking APIs. > I wonder if we can add a check like "mark the pool as BH-only on first > BH use, and WARN() on process use afterwards". But I'm not sure what > CONFIG you'd accept that being under ;) > The PP alloc side is designed as a Single Consumer data-structure/model on alloc side, (via a lockless cache/array). That on empty cache fallback to bulk from a Multi Consumer data-structure/model to amortize that cost. This is where the PP speedup comes from. --Jesper
On 28/08/2023 13.07, Alexander Lobakin wrote: >> This can be a workaround fix: >> >> $ git diff >> diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c >> b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c >> index dce3cea00032..ab7ca146fddf 100644 >> --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c >> +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c >> @@ -578,6 +578,10 @@ int otx2_alloc_buffer(struct otx2_nic *pfvf, struct >> otx2_cq_queue *cq, >> struct refill_work *work; >> struct delayed_work *dwork; >> >> + /* page_pool alloc API cannot be used from WQ */ >> + if (cq->rbpool->page_pool) >> + return -ENOMEM; > I believe that breaks the driver? > Why would that break the driver? AFAIK returning 0 here will break the driver. We need to return something non-zero, see otx2_refill_pool_ptrs() copy-pasted below signature. >> + >> work = &pfvf->refill_wrk[cq->cq_idx]; >> dwork = &work->pool_refill_work; >> /* Schedule a task if no other task is running */ --Jesper void otx2_refill_pool_ptrs(void *dev, struct otx2_cq_queue *cq) { struct otx2_nic *pfvf = dev; dma_addr_t bufptr; while (cq->pool_ptrs) { if (otx2_alloc_buffer(pfvf, cq, &bufptr)) break; otx2_aura_freeptr(pfvf, cq->cq_idx, bufptr + OTX2_HEAD_ROOM); cq->pool_ptrs--; } }
On 2023-08-28 13:07:12 [+0200], Alexander Lobakin wrote: > > Looking again at the driver otx2_txrx.c NAPI code path also calls PP > > directly in otx2_napi_handler() will call refill_pool_ptrs() -> > > otx2_refill_pool_ptrs() -> otx2_alloc_buffer() -> __otx2_alloc_rbuf() -> > > if (pool->page_pool) otx2_alloc_pool_buf() -> page_pool_alloc_frag(). > > > > The function otx2_alloc_buffer() can also choose to start a WQ, that > > also called PP alloc API this time via otx2_alloc_rbuf() that have > > BH-disable. Like Sebastian, I don't think this is safe! > > Disabling BH doesn't look correct to me, but I don't see issues in > having consumer and producer working on different cores, as long as they > use ptr_ring with locks. After learning what p.napi is about, may I point out that there are also users that don't check that and use page_pool_put_full_page() or later? While I can't comment on the bpf/XDP users, browsing otx2: there is this: | otx2_stop() | -> otx2_free_hw_resources() | -> otx2_cleanup_rx_cqes | -> otx2_free_bufs | -> page_pool_put_full_page(, true) | -> cancel_delayed_work_sync() otx2 is "safe" here due to the in_softirq() check in __page_pool_put_page(). But: the worker could run and fill the lock less buffer while otx2_free_bufs() is doing clean up/ removing pages and filling this buffer on another CPU. The worker is synchronised after the free. The lack of BH-disable in otx2_stop()'s path safes the day here. (It looks somehow suspicious that there is first "free mem" followed by "sync refill worker" and not the other way around) Sebastian
> From: Jesper Dangaard Brouer <hawk@kernel.org> > Subject: [EXT] Re: [BUG] Possible unsafe page_pool usage in octeontx2 > > This WQ process is not allowed to use the page_pool_alloc() API this way > (from a work-queue). The PP alloc-side API must only be used under NAPI > protection. Thanks for spotting this Sebastian! > > Will/can any of the Cc'ed Marvell people work on a fix? We are working on it.
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 7ff80b80a6f9f..b50e219470a36 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, page_pool_dma_sync_for_device(pool, page, dma_sync_size); - if (allow_direct && in_softirq() && + if (allow_direct && in_serving_softirq() && page_pool_recycle_in_cache(page, pool)) return NULL;