Message ID | 168485357834.2849279.8073426325295894331.stgit@firesoul (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | page_pool: new approach for leak detection and shutdown phase | expand |
> void page_pool_destroy(struct page_pool *pool) > { > + unsigned int flags; > + u32 release_cnt; > + u32 hold_cnt; > + > if (!pool) > return; > > @@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool) > if (!page_pool_release(pool)) > return; > > - pool->defer_start = jiffies; > - pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; > + /* PP have pages inflight, thus cannot immediately release memory. > + * Enter into shutdown phase, depending on remaining in-flight PP > + * pages to trigger shutdown process (on concurrent CPUs) and last > + * page will free pool instance. > + * > + * There exist two race conditions here, we need to take into > + * account in the following code. > + * > + * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last > + * pages into the ptr_ring. Thus, it missed triggering shutdown > + * process, which can then be stalled forever. > + * > + * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last > + * page, which triggered shutdown process and freed pool > + * instance. Thus, its not safe to dereference *pool afterwards. > + * > + * Handling races by holding a fake in-flight count, via artificially > + * bumping pages_state_hold_cnt, which assures pool isn't freed under > + * us. Use RCU Grace-Periods to guarantee concurrent CPUs will > + * transition safely into the shutdown phase. > + * > + * After safely transition into this state the races are resolved. For > + * race(1) its safe to recheck and empty ptr_ring (it will not free > + * pool). Race(2) cannot happen, and we can release fake in-flight count > + * as last step. > + */ > + hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1; > + WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt); > + synchronize_rcu(); > + > + flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN; > + WRITE_ONCE(pool->p.flags, flags); > + synchronize_rcu(); Hmm, synchronize_rcu() can be quite expensive; why do we need two of them? Should be fine to just do one after those two writes, as long as the order of those writes is correct (which WRITE_ONCE should ensure)? Also, if we're adding this (blocking) operation in the teardown path we risk adding latency to that path (network interface removal, BPF_PROG_RUN syscall etc), so not sure if this actually ends up being an improvement anymore, as opposed to just keeping the workqueue but dropping the warning? -Toke
On 2023/5/24 0:16, Toke Høiland-Jørgensen wrote: >> void page_pool_destroy(struct page_pool *pool) >> { >> + unsigned int flags; >> + u32 release_cnt; >> + u32 hold_cnt; >> + >> if (!pool) >> return; >> >> @@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool) >> if (!page_pool_release(pool)) >> return; >> >> - pool->defer_start = jiffies; >> - pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; >> + /* PP have pages inflight, thus cannot immediately release memory. >> + * Enter into shutdown phase, depending on remaining in-flight PP >> + * pages to trigger shutdown process (on concurrent CPUs) and last >> + * page will free pool instance. >> + * >> + * There exist two race conditions here, we need to take into >> + * account in the following code. >> + * >> + * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last >> + * pages into the ptr_ring. Thus, it missed triggering shutdown >> + * process, which can then be stalled forever. >> + * >> + * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last >> + * page, which triggered shutdown process and freed pool >> + * instance. Thus, its not safe to dereference *pool afterwards. >> + * >> + * Handling races by holding a fake in-flight count, via artificially >> + * bumping pages_state_hold_cnt, which assures pool isn't freed under >> + * us. Use RCU Grace-Periods to guarantee concurrent CPUs will >> + * transition safely into the shutdown phase. >> + * >> + * After safely transition into this state the races are resolved. For >> + * race(1) its safe to recheck and empty ptr_ring (it will not free >> + * pool). Race(2) cannot happen, and we can release fake in-flight count >> + * as last step. >> + */ >> + hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1; >> + WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt); >> + synchronize_rcu(); >> + >> + flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN; >> + WRITE_ONCE(pool->p.flags, flags); >> + synchronize_rcu(); > > Hmm, synchronize_rcu() can be quite expensive; why do we need two of > them? Should be fine to just do one after those two writes, as long as > the order of those writes is correct (which WRITE_ONCE should ensure)? I am not sure rcu is the right scheme to fix the problem, as rcu is usually for one doing freeing/updating and many doing reading, while the case we try to fix here is all doing the reading and trying to do the freeing. And there might still be data race here as below: cpu0 calling page_pool_destroy() cpu1 caling page_pool_release_page() WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt); WRITE_ONCE(pool->p.flags, flags); synchronize_rcu(); atomic_inc_return() release_cnt = atomic_inc_return(); page_pool_free_attempt(pool, release_cnt); rcu call page_pool_free_rcu() if (READ_ONCE(pool->p.flags) & PP_FLAG_SHUTDOWN) page_pool_free_attempt() As the rcu_read_[un]lock are only in page_pool_free_attempt(), cpu0 will see the inflight being zero and triger the rcu to free the pp, and cpu1 see the pool->p.flags with PP_FLAG_SHUTDOWN set, it will access pool->pages_state_hold_cnt in __page_pool_inflight(), causing a use-after-free problem? > > Also, if we're adding this (blocking) operation in the teardown path we > risk adding latency to that path (network interface removal, > BPF_PROG_RUN syscall etc), so not sure if this actually ends up being an > improvement anymore, as opposed to just keeping the workqueue but > dropping the warning? we might be able to remove the workqueue from the destroy path, a workqueue might be still needed to be trigered to call page_pool_free() in non-atomic context instead of calling page_pool_free() directly in page_pool_release_page(), as page_pool_release_page() might be called in atomic context and page_pool_free() requires a non-atomic context for put_device() and pool->disconnect using the mutex_lock() in mem_allocator_disconnect().
On 24/05/2023 14.00, Yunsheng Lin wrote: > On 2023/5/24 0:16, Toke Høiland-Jørgensen wrote: >>> void page_pool_destroy(struct page_pool *pool) >>> { >>> + unsigned int flags; >>> + u32 release_cnt; >>> + u32 hold_cnt; >>> + >>> if (!pool) >>> return; >>> >>> @@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool) >>> if (!page_pool_release(pool)) >>> return; >>> >>> - pool->defer_start = jiffies; >>> - pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; >>> + /* PP have pages inflight, thus cannot immediately release memory. >>> + * Enter into shutdown phase, depending on remaining in-flight PP >>> + * pages to trigger shutdown process (on concurrent CPUs) and last >>> + * page will free pool instance. >>> + * >>> + * There exist two race conditions here, we need to take into >>> + * account in the following code. >>> + * >>> + * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last >>> + * pages into the ptr_ring. Thus, it missed triggering shutdown >>> + * process, which can then be stalled forever. >>> + * >>> + * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last >>> + * page, which triggered shutdown process and freed pool >>> + * instance. Thus, its not safe to dereference *pool afterwards. >>> + * >>> + * Handling races by holding a fake in-flight count, via artificially >>> + * bumping pages_state_hold_cnt, which assures pool isn't freed under >>> + * us. Use RCU Grace-Periods to guarantee concurrent CPUs will >>> + * transition safely into the shutdown phase. >>> + * >>> + * After safely transition into this state the races are resolved. For >>> + * race(1) its safe to recheck and empty ptr_ring (it will not free >>> + * pool). Race(2) cannot happen, and we can release fake in-flight count >>> + * as last step. >>> + */ >>> + hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1; >>> + WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt); >>> + synchronize_rcu(); >>> + >>> + flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN; >>> + WRITE_ONCE(pool->p.flags, flags); >>> + synchronize_rcu(); >> >> Hmm, synchronize_rcu() can be quite expensive; why do we need two of >> them? Should be fine to just do one after those two writes, as long as >> the order of those writes is correct (which WRITE_ONCE should ensure)? > > I am not sure rcu is the right scheme to fix the problem, as rcu is usually > for one doing freeing/updating and many doing reading, while the case we > try to fix here is all doing the reading and trying to do the freeing. > > And there might still be data race here as below: > cpu0 calling page_pool_destroy() cpu1 caling page_pool_release_page() > > WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt); > WRITE_ONCE(pool->p.flags, flags); > synchronize_rcu(); > atomic_inc_return() > > release_cnt = atomic_inc_return(); > page_pool_free_attempt(pool, release_cnt); > rcu call page_pool_free_rcu() > > if (READ_ONCE(pool->p.flags) & PP_FLAG_SHUTDOWN) > page_pool_free_attempt() > > As the rcu_read_[un]lock are only in page_pool_free_attempt(), cpu0 > will see the inflight being zero and triger the rcu to free the pp, > and cpu1 see the pool->p.flags with PP_FLAG_SHUTDOWN set, it will > access pool->pages_state_hold_cnt in __page_pool_inflight(), causing > a use-after-free problem? > > >> >> Also, if we're adding this (blocking) operation in the teardown path we >> risk adding latency to that path (network interface removal, >> BPF_PROG_RUN syscall etc), so not sure if this actually ends up being an >> improvement anymore, as opposed to just keeping the workqueue but >> dropping the warning? > > we might be able to remove the workqueue from the destroy path, a > workqueue might be still needed to be trigered to call page_pool_free() > in non-atomic context instead of calling page_pool_free() directly in > page_pool_release_page(), as page_pool_release_page() might be called > in atomic context and page_pool_free() requires a non-atomic context > for put_device() and pool->disconnect using the mutex_lock() in > mem_allocator_disconnect(). > I thought the call_rcu() callback provided the right context, but skimming call_rcu() I think it doesn't. Argh, I think you are right, we cannot avoid the workqueue, as we need the non-atomic context. Thanks for catching and pointing this out :-) --Jesper
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index c8ec2f34722b..19396e05f12d 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -50,6 +50,9 @@ PP_FLAG_DMA_SYNC_DEV |\ PP_FLAG_PAGE_FRAG) +/* Internal flag: PP in shutdown phase, waiting for inflight pages */ +#define PP_FLAG_SHUTDOWN BIT(8) + /* * Fast allocation side cache array/stack * @@ -151,11 +154,6 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats) struct page_pool { struct page_pool_params p; - struct delayed_work release_dw; - void (*disconnect)(void *); - unsigned long defer_start; - unsigned long defer_warn; - u32 pages_state_hold_cnt; unsigned int frag_offset; struct page *frag_page; @@ -165,6 +163,7 @@ struct page_pool { /* these stats are incremented while in softirq context */ struct page_pool_alloc_stats alloc_stats; #endif + void (*disconnect)(void *); u32 xdp_mem_id; /* @@ -208,6 +207,7 @@ struct page_pool { refcount_t user_cnt; u64 destroy_cnt; + struct rcu_head rcu; }; struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp); diff --git a/net/core/page_pool.c b/net/core/page_pool.c index e212e9d7edcb..213349148a48 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -23,9 +23,6 @@ #include <trace/events/page_pool.h> -#define DEFER_TIME (msecs_to_jiffies(1000)) -#define DEFER_WARN_INTERVAL (60 * HZ) - #define BIAS_MAX LONG_MAX #ifdef CONFIG_PAGE_POOL_STATS @@ -380,6 +377,10 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, struct page *page; int i, nr_pages; + /* API usage BUG: PP in shutdown phase, cannot alloc new pages */ + if (WARN_ON(pool->p.flags & PP_FLAG_SHUTDOWN)) + return NULL; + /* Don't support bulk alloc for high-order pages */ if (unlikely(pp_order)) return __page_pool_alloc_page_order(pool, gfp); @@ -450,9 +451,8 @@ EXPORT_SYMBOL(page_pool_alloc_pages); */ #define _distance(a, b) (s32)((a) - (b)) -static s32 page_pool_inflight(struct page_pool *pool) +static s32 __page_pool_inflight(struct page_pool *pool, u32 release_cnt) { - u32 release_cnt = atomic_read(&pool->pages_state_release_cnt); u32 hold_cnt = READ_ONCE(pool->pages_state_hold_cnt); s32 inflight; @@ -464,6 +464,14 @@ static s32 page_pool_inflight(struct page_pool *pool) return inflight; } +static s32 page_pool_inflight(struct page_pool *pool) +{ + u32 release_cnt = atomic_read(&pool->pages_state_release_cnt); + return __page_pool_inflight(pool, release_cnt); +} + +static int page_pool_free_attempt(struct page_pool *pool, u32 release_cnt); + /* Disconnects a page (from a page_pool). API users can have a need * to disconnect a page (from a page_pool), to allow it to be used as * a regular page (that will eventually be returned to the normal @@ -472,7 +480,7 @@ static s32 page_pool_inflight(struct page_pool *pool) void page_pool_release_page(struct page_pool *pool, struct page *page) { dma_addr_t dma; - int count; + u32 release_cnt; if (!(pool->p.flags & PP_FLAG_DMA_MAP)) /* Always account for inflight pages, even if we didn't @@ -490,11 +498,12 @@ void page_pool_release_page(struct page_pool *pool, struct page *page) skip_dma_unmap: page_pool_clear_pp_info(page); - /* This may be the last page returned, releasing the pool, so - * it is not safe to reference pool afterwards. - */ - count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); - trace_page_pool_state_release(pool, page, count); + release_cnt = atomic_inc_return(&pool->pages_state_release_cnt); + trace_page_pool_state_release(pool, page, release_cnt); + + /* In shutdown phase, last page will free pool instance */ + if (READ_ONCE(pool->p.flags) & PP_FLAG_SHUTDOWN) + page_pool_free_attempt(pool, release_cnt); } EXPORT_SYMBOL(page_pool_release_page); @@ -535,7 +544,7 @@ static bool page_pool_recycle_in_ring(struct page_pool *pool, struct page *page) static bool page_pool_recycle_in_cache(struct page *page, struct page_pool *pool) { - if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE)) { + if (pool->alloc.count == PP_ALLOC_CACHE_SIZE) { recycle_stat_inc(pool, cache_full); return false; } @@ -546,6 +555,8 @@ static bool page_pool_recycle_in_cache(struct page *page, return true; } +static void page_pool_empty_ring(struct page_pool *pool); + /* If the page refcnt == 1, this will try to recycle the page. * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for * the configured size min(dma_sync_size, pool->max_len). @@ -572,7 +583,8 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, page_pool_dma_sync_for_device(pool, page, dma_sync_size); - if (allow_direct && in_softirq() && + /* During PP shutdown, no direct recycle must occur */ + if (likely(allow_direct && in_softirq()) && page_pool_recycle_in_cache(page, pool)) return NULL; @@ -609,6 +621,8 @@ void page_pool_put_defragged_page(struct page_pool *pool, struct page *page, recycle_stat_inc(pool, ring_full); page_pool_return_page(pool, page); } + if (page && pool->p.flags & PP_FLAG_SHUTDOWN) + page_pool_empty_ring(pool); } EXPORT_SYMBOL(page_pool_put_defragged_page); @@ -646,6 +660,9 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, recycle_stat_add(pool, ring, i); page_pool_ring_unlock(pool); + if (pool->p.flags & PP_FLAG_SHUTDOWN) + page_pool_empty_ring(pool); + /* Hopefully all pages was return into ptr_ring */ if (likely(i == bulk_len)) return; @@ -737,12 +754,18 @@ struct page *page_pool_alloc_frag(struct page_pool *pool, } EXPORT_SYMBOL(page_pool_alloc_frag); +noinline static void page_pool_empty_ring(struct page_pool *pool) { - struct page *page; + struct page *page, *next; + + next = ptr_ring_consume_bh(&pool->ring); /* Empty recycle ring */ - while ((page = ptr_ring_consume_bh(&pool->ring))) { + while (next) { + page = next; + next = ptr_ring_consume_bh(&pool->ring); + /* Verify the refcnt invariant of cached pages */ if (!(page_ref_count(page) == 1)) pr_crit("%s() page_pool refcnt %d violation\n", @@ -768,6 +791,14 @@ static void page_pool_free(struct page_pool *pool) kfree(pool); } +static void page_pool_free_rcu(struct rcu_head *rcu) +{ + struct page_pool *pool; + + pool = container_of(rcu, struct page_pool, rcu); + page_pool_free(pool); +} + static void page_pool_empty_alloc_cache_once(struct page_pool *pool) { struct page *page; @@ -796,39 +827,30 @@ static void page_pool_scrub(struct page_pool *pool) page_pool_empty_ring(pool); } -static int page_pool_release(struct page_pool *pool) +noinline +static int page_pool_free_attempt(struct page_pool *pool, u32 release_cnt) { int inflight; - page_pool_scrub(pool); - inflight = page_pool_inflight(pool); + rcu_read_lock(); + inflight = __page_pool_inflight(pool, release_cnt); + rcu_read_unlock(); if (!inflight) - page_pool_free(pool); + call_rcu(&pool->rcu, page_pool_free_rcu); return inflight; } -static void page_pool_release_retry(struct work_struct *wq) +static int page_pool_release(struct page_pool *pool) { - struct delayed_work *dwq = to_delayed_work(wq); - struct page_pool *pool = container_of(dwq, typeof(*pool), release_dw); int inflight; - inflight = page_pool_release(pool); + page_pool_scrub(pool); + inflight = page_pool_inflight(pool); if (!inflight) - return; - - /* Periodic warning */ - if (time_after_eq(jiffies, pool->defer_warn)) { - int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ; - - pr_warn("%s() stalled pool shutdown %d inflight %d sec\n", - __func__, inflight, sec); - pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; - } + page_pool_free(pool); - /* Still not ready to be disconnected, retry later */ - schedule_delayed_work(&pool->release_dw, DEFER_TIME); + return inflight; } void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *), @@ -856,6 +878,10 @@ EXPORT_SYMBOL(page_pool_unlink_napi); void page_pool_destroy(struct page_pool *pool) { + unsigned int flags; + u32 release_cnt; + u32 hold_cnt; + if (!pool) return; @@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool) if (!page_pool_release(pool)) return; - pool->defer_start = jiffies; - pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; + /* PP have pages inflight, thus cannot immediately release memory. + * Enter into shutdown phase, depending on remaining in-flight PP + * pages to trigger shutdown process (on concurrent CPUs) and last + * page will free pool instance. + * + * There exist two race conditions here, we need to take into + * account in the following code. + * + * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last + * pages into the ptr_ring. Thus, it missed triggering shutdown + * process, which can then be stalled forever. + * + * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last + * page, which triggered shutdown process and freed pool + * instance. Thus, its not safe to dereference *pool afterwards. + * + * Handling races by holding a fake in-flight count, via artificially + * bumping pages_state_hold_cnt, which assures pool isn't freed under + * us. Use RCU Grace-Periods to guarantee concurrent CPUs will + * transition safely into the shutdown phase. + * + * After safely transition into this state the races are resolved. For + * race(1) its safe to recheck and empty ptr_ring (it will not free + * pool). Race(2) cannot happen, and we can release fake in-flight count + * as last step. + */ + hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1; + WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt); + synchronize_rcu(); + + flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN; + WRITE_ONCE(pool->p.flags, flags); + synchronize_rcu(); + + /* Concurrent CPUs could have returned last pages into ptr_ring */ + page_pool_empty_ring(pool); - INIT_DELAYED_WORK(&pool->release_dw, page_pool_release_retry); - schedule_delayed_work(&pool->release_dw, DEFER_TIME); + release_cnt = atomic_inc_return(&pool->pages_state_release_cnt); + page_pool_free_attempt(pool, release_cnt); } EXPORT_SYMBOL(page_pool_destroy);
This removes the workqueue scheme that periodically tests when inflight reach zero such that page_pool memory can be freed. This change adds code to fast-path free checking for a shutdown flags bit after returning PP pages. Performance is very important for PP, as the fast path is used for XDP_DROP use-cases where NIC drivers recycle PP pages directly into PP alloc cache. This patch (since V3) shows zero impact on this fast path. Micro benchmarked with [1] on Intel CPU E5-1650 @3.60GHz. The slight code reorg of likely() are deliberate. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- include/net/page_pool.h | 10 ++- net/core/page_pool.c | 138 ++++++++++++++++++++++++++++++++++------------- 2 files changed, 104 insertions(+), 44 deletions(-)