diff mbox series

[RFC,net-next/mm,V4,2/2] page_pool: Remove workqueue in new shutdown scheme

Message ID 168485357834.2849279.8073426325295894331.stgit@firesoul (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series page_pool: new approach for leak detection and shutdown phase | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5179 this patch: 5179
netdev/cc_maintainers warning 2 maintainers not CCed: hawk@kernel.org edumazet@google.com
netdev/build_clang success Errors and warnings before: 1977 this patch: 1977
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5410 this patch: 5410
netdev/checkpatch warning WARNING: Missing a blank line after declarations
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jesper Dangaard Brouer May 23, 2023, 2:52 p.m. UTC
This removes the workqueue scheme that periodically tests when
inflight reach zero such that page_pool memory can be freed.

This change adds code to fast-path free checking for a shutdown flags
bit after returning PP pages.

Performance is very important for PP, as the fast path is used for
XDP_DROP use-cases where NIC drivers recycle PP pages directly into PP
alloc cache.

This patch (since V3) shows zero impact on this fast path. Micro
benchmarked with [1] on Intel CPU E5-1650 @3.60GHz. The slight code
reorg of likely() are deliberate.

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/page_pool.h |   10 ++-
 net/core/page_pool.c    |  138 ++++++++++++++++++++++++++++++++++-------------
 2 files changed, 104 insertions(+), 44 deletions(-)

Comments

Toke Høiland-Jørgensen May 23, 2023, 4:16 p.m. UTC | #1
>  void page_pool_destroy(struct page_pool *pool)
>  {
> +	unsigned int flags;
> +	u32 release_cnt;
> +	u32 hold_cnt;
> +
>  	if (!pool)
>  		return;
>  
> @@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool)
>  	if (!page_pool_release(pool))
>  		return;
>  
> -	pool->defer_start = jiffies;
> -	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> +	/* PP have pages inflight, thus cannot immediately release memory.
> +	 * Enter into shutdown phase, depending on remaining in-flight PP
> +	 * pages to trigger shutdown process (on concurrent CPUs) and last
> +	 * page will free pool instance.
> +	 *
> +	 * There exist two race conditions here, we need to take into
> +	 * account in the following code.
> +	 *
> +	 * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
> +	 *    pages into the ptr_ring.  Thus, it missed triggering shutdown
> +	 *    process, which can then be stalled forever.
> +	 *
> +	 * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
> +	 *    page, which triggered shutdown process and freed pool
> +	 *    instance. Thus, its not safe to dereference *pool afterwards.
> +	 *
> +	 * Handling races by holding a fake in-flight count, via artificially
> +	 * bumping pages_state_hold_cnt, which assures pool isn't freed under
> +	 * us.  Use RCU Grace-Periods to guarantee concurrent CPUs will
> +	 * transition safely into the shutdown phase.
> +	 *
> +	 * After safely transition into this state the races are resolved.  For
> +	 * race(1) its safe to recheck and empty ptr_ring (it will not free
> +	 * pool). Race(2) cannot happen, and we can release fake in-flight count
> +	 * as last step.
> +	 */
> +	hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
> +	WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
> +	synchronize_rcu();
> +
> +	flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;
> +	WRITE_ONCE(pool->p.flags, flags);
> +	synchronize_rcu();

Hmm, synchronize_rcu() can be quite expensive; why do we need two of
them? Should be fine to just do one after those two writes, as long as
the order of those writes is correct (which WRITE_ONCE should ensure)?

Also, if we're adding this (blocking) operation in the teardown path we
risk adding latency to that path (network interface removal,
BPF_PROG_RUN syscall etc), so not sure if this actually ends up being an
improvement anymore, as opposed to just keeping the workqueue but
dropping the warning?

-Toke
Yunsheng Lin May 24, 2023, noon UTC | #2
On 2023/5/24 0:16, Toke Høiland-Jørgensen wrote:
>>  void page_pool_destroy(struct page_pool *pool)
>>  {
>> +	unsigned int flags;
>> +	u32 release_cnt;
>> +	u32 hold_cnt;
>> +
>>  	if (!pool)
>>  		return;
>>  
>> @@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool)
>>  	if (!page_pool_release(pool))
>>  		return;
>>  
>> -	pool->defer_start = jiffies;
>> -	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>> +	/* PP have pages inflight, thus cannot immediately release memory.
>> +	 * Enter into shutdown phase, depending on remaining in-flight PP
>> +	 * pages to trigger shutdown process (on concurrent CPUs) and last
>> +	 * page will free pool instance.
>> +	 *
>> +	 * There exist two race conditions here, we need to take into
>> +	 * account in the following code.
>> +	 *
>> +	 * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
>> +	 *    pages into the ptr_ring.  Thus, it missed triggering shutdown
>> +	 *    process, which can then be stalled forever.
>> +	 *
>> +	 * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
>> +	 *    page, which triggered shutdown process and freed pool
>> +	 *    instance. Thus, its not safe to dereference *pool afterwards.
>> +	 *
>> +	 * Handling races by holding a fake in-flight count, via artificially
>> +	 * bumping pages_state_hold_cnt, which assures pool isn't freed under
>> +	 * us.  Use RCU Grace-Periods to guarantee concurrent CPUs will
>> +	 * transition safely into the shutdown phase.
>> +	 *
>> +	 * After safely transition into this state the races are resolved.  For
>> +	 * race(1) its safe to recheck and empty ptr_ring (it will not free
>> +	 * pool). Race(2) cannot happen, and we can release fake in-flight count
>> +	 * as last step.
>> +	 */
>> +	hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
>> +	WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
>> +	synchronize_rcu();
>> +
>> +	flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;
>> +	WRITE_ONCE(pool->p.flags, flags);
>> +	synchronize_rcu();
> 
> Hmm, synchronize_rcu() can be quite expensive; why do we need two of
> them? Should be fine to just do one after those two writes, as long as
> the order of those writes is correct (which WRITE_ONCE should ensure)?

I am not sure rcu is the right scheme to fix the problem, as rcu is usually
for one doing freeing/updating and many doing reading, while the case we
try to fix here is all doing the reading and trying to do the freeing.

And there might still be data race here as below:
     cpu0 calling page_pool_destroy()                cpu1 caling page_pool_release_page()

WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
      WRITE_ONCE(pool->p.flags, flags);
           synchronize_rcu();
                                                             atomic_inc_return()

        release_cnt = atomic_inc_return();
      page_pool_free_attempt(pool, release_cnt);
        rcu call page_pool_free_rcu()

				                     if (READ_ONCE(pool->p.flags) & PP_FLAG_SHUTDOWN)
                                                               page_pool_free_attempt()

As the rcu_read_[un]lock are only in page_pool_free_attempt(), cpu0
will see the inflight being zero and triger the rcu to free the pp,
and cpu1 see the pool->p.flags with PP_FLAG_SHUTDOWN set, it will
access pool->pages_state_hold_cnt in __page_pool_inflight(), causing
a use-after-free problem?


> 
> Also, if we're adding this (blocking) operation in the teardown path we
> risk adding latency to that path (network interface removal,
> BPF_PROG_RUN syscall etc), so not sure if this actually ends up being an
> improvement anymore, as opposed to just keeping the workqueue but
> dropping the warning?

we might be able to remove the workqueue from the destroy path, a
workqueue might be still needed to be trigered to call page_pool_free()
in non-atomic context instead of calling page_pool_free() directly in
page_pool_release_page(), as page_pool_release_page() might be called
in atomic context and page_pool_free() requires a non-atomic context
for put_device() and pool->disconnect using the mutex_lock() in
mem_allocator_disconnect().
Jesper Dangaard Brouer May 24, 2023, 4:42 p.m. UTC | #3
On 24/05/2023 14.00, Yunsheng Lin wrote:
> On 2023/5/24 0:16, Toke Høiland-Jørgensen wrote:
>>>   void page_pool_destroy(struct page_pool *pool)
>>>   {
>>> +	unsigned int flags;
>>> +	u32 release_cnt;
>>> +	u32 hold_cnt;
>>> +
>>>   	if (!pool)
>>>   		return;
>>>   
>>> @@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool)
>>>   	if (!page_pool_release(pool))
>>>   		return;
>>>   
>>> -	pool->defer_start = jiffies;
>>> -	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>>> +	/* PP have pages inflight, thus cannot immediately release memory.
>>> +	 * Enter into shutdown phase, depending on remaining in-flight PP
>>> +	 * pages to trigger shutdown process (on concurrent CPUs) and last
>>> +	 * page will free pool instance.
>>> +	 *
>>> +	 * There exist two race conditions here, we need to take into
>>> +	 * account in the following code.
>>> +	 *
>>> +	 * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
>>> +	 *    pages into the ptr_ring.  Thus, it missed triggering shutdown
>>> +	 *    process, which can then be stalled forever.
>>> +	 *
>>> +	 * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
>>> +	 *    page, which triggered shutdown process and freed pool
>>> +	 *    instance. Thus, its not safe to dereference *pool afterwards.
>>> +	 *
>>> +	 * Handling races by holding a fake in-flight count, via artificially
>>> +	 * bumping pages_state_hold_cnt, which assures pool isn't freed under
>>> +	 * us.  Use RCU Grace-Periods to guarantee concurrent CPUs will
>>> +	 * transition safely into the shutdown phase.
>>> +	 *
>>> +	 * After safely transition into this state the races are resolved.  For
>>> +	 * race(1) its safe to recheck and empty ptr_ring (it will not free
>>> +	 * pool). Race(2) cannot happen, and we can release fake in-flight count
>>> +	 * as last step.
>>> +	 */
>>> +	hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
>>> +	WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
>>> +	synchronize_rcu();
>>> +
>>> +	flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;
>>> +	WRITE_ONCE(pool->p.flags, flags);
>>> +	synchronize_rcu();
>>
>> Hmm, synchronize_rcu() can be quite expensive; why do we need two of
>> them? Should be fine to just do one after those two writes, as long as
>> the order of those writes is correct (which WRITE_ONCE should ensure)?
> 
> I am not sure rcu is the right scheme to fix the problem, as rcu is usually
> for one doing freeing/updating and many doing reading, while the case we
> try to fix here is all doing the reading and trying to do the freeing.
> 
> And there might still be data race here as below:
>       cpu0 calling page_pool_destroy()                cpu1 caling page_pool_release_page()
> 
> WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
>        WRITE_ONCE(pool->p.flags, flags);
>             synchronize_rcu();
>                                                               atomic_inc_return()
> 
>          release_cnt = atomic_inc_return();
>        page_pool_free_attempt(pool, release_cnt);
>          rcu call page_pool_free_rcu()
> 
> 				                     if (READ_ONCE(pool->p.flags) & PP_FLAG_SHUTDOWN)
>                                                                 page_pool_free_attempt()
> 
> As the rcu_read_[un]lock are only in page_pool_free_attempt(), cpu0
> will see the inflight being zero and triger the rcu to free the pp,
> and cpu1 see the pool->p.flags with PP_FLAG_SHUTDOWN set, it will
> access pool->pages_state_hold_cnt in __page_pool_inflight(), causing
> a use-after-free problem?
> 
> 
>>
>> Also, if we're adding this (blocking) operation in the teardown path we
>> risk adding latency to that path (network interface removal,
>> BPF_PROG_RUN syscall etc), so not sure if this actually ends up being an
>> improvement anymore, as opposed to just keeping the workqueue but
>> dropping the warning?
> 
> we might be able to remove the workqueue from the destroy path, a
> workqueue might be still needed to be trigered to call page_pool_free()
> in non-atomic context instead of calling page_pool_free() directly in
> page_pool_release_page(), as page_pool_release_page() might be called
> in atomic context and page_pool_free() requires a non-atomic context
> for put_device() and pool->disconnect using the mutex_lock() in
> mem_allocator_disconnect().
> 

I thought the call_rcu() callback provided the right context, but
skimming call_rcu() I think it doesn't.  Argh, I think you are right, we
cannot avoid the workqueue, as we need the non-atomic context.

Thanks for catching and pointing this out :-)

--Jesper
diff mbox series

Patch

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index c8ec2f34722b..19396e05f12d 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -50,6 +50,9 @@ 
 				 PP_FLAG_DMA_SYNC_DEV |\
 				 PP_FLAG_PAGE_FRAG)
 
+/* Internal flag: PP in shutdown phase, waiting for inflight pages */
+#define PP_FLAG_SHUTDOWN	BIT(8)
+
 /*
  * Fast allocation side cache array/stack
  *
@@ -151,11 +154,6 @@  static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
 struct page_pool {
 	struct page_pool_params p;
 
-	struct delayed_work release_dw;
-	void (*disconnect)(void *);
-	unsigned long defer_start;
-	unsigned long defer_warn;
-
 	u32 pages_state_hold_cnt;
 	unsigned int frag_offset;
 	struct page *frag_page;
@@ -165,6 +163,7 @@  struct page_pool {
 	/* these stats are incremented while in softirq context */
 	struct page_pool_alloc_stats alloc_stats;
 #endif
+	void (*disconnect)(void *);
 	u32 xdp_mem_id;
 
 	/*
@@ -208,6 +207,7 @@  struct page_pool {
 	refcount_t user_cnt;
 
 	u64 destroy_cnt;
+	struct rcu_head rcu;
 };
 
 struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index e212e9d7edcb..213349148a48 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -23,9 +23,6 @@ 
 
 #include <trace/events/page_pool.h>
 
-#define DEFER_TIME (msecs_to_jiffies(1000))
-#define DEFER_WARN_INTERVAL (60 * HZ)
-
 #define BIAS_MAX	LONG_MAX
 
 #ifdef CONFIG_PAGE_POOL_STATS
@@ -380,6 +377,10 @@  static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	struct page *page;
 	int i, nr_pages;
 
+	/* API usage BUG: PP in shutdown phase, cannot alloc new pages */
+	if (WARN_ON(pool->p.flags & PP_FLAG_SHUTDOWN))
+		return NULL;
+
 	/* Don't support bulk alloc for high-order pages */
 	if (unlikely(pp_order))
 		return __page_pool_alloc_page_order(pool, gfp);
@@ -450,9 +451,8 @@  EXPORT_SYMBOL(page_pool_alloc_pages);
  */
 #define _distance(a, b)	(s32)((a) - (b))
 
-static s32 page_pool_inflight(struct page_pool *pool)
+static s32 __page_pool_inflight(struct page_pool *pool, u32 release_cnt)
 {
-	u32 release_cnt = atomic_read(&pool->pages_state_release_cnt);
 	u32 hold_cnt = READ_ONCE(pool->pages_state_hold_cnt);
 	s32 inflight;
 
@@ -464,6 +464,14 @@  static s32 page_pool_inflight(struct page_pool *pool)
 	return inflight;
 }
 
+static s32 page_pool_inflight(struct page_pool *pool)
+{
+	u32 release_cnt = atomic_read(&pool->pages_state_release_cnt);
+	return __page_pool_inflight(pool, release_cnt);
+}
+
+static int page_pool_free_attempt(struct page_pool *pool, u32 release_cnt);
+
 /* Disconnects a page (from a page_pool).  API users can have a need
  * to disconnect a page (from a page_pool), to allow it to be used as
  * a regular page (that will eventually be returned to the normal
@@ -472,7 +480,7 @@  static s32 page_pool_inflight(struct page_pool *pool)
 void page_pool_release_page(struct page_pool *pool, struct page *page)
 {
 	dma_addr_t dma;
-	int count;
+	u32 release_cnt;
 
 	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
 		/* Always account for inflight pages, even if we didn't
@@ -490,11 +498,12 @@  void page_pool_release_page(struct page_pool *pool, struct page *page)
 skip_dma_unmap:
 	page_pool_clear_pp_info(page);
 
-	/* This may be the last page returned, releasing the pool, so
-	 * it is not safe to reference pool afterwards.
-	 */
-	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
-	trace_page_pool_state_release(pool, page, count);
+	release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
+	trace_page_pool_state_release(pool, page, release_cnt);
+
+	/* In shutdown phase, last page will free pool instance */
+	if (READ_ONCE(pool->p.flags) & PP_FLAG_SHUTDOWN)
+		page_pool_free_attempt(pool, release_cnt);
 }
 EXPORT_SYMBOL(page_pool_release_page);
 
@@ -535,7 +544,7 @@  static bool page_pool_recycle_in_ring(struct page_pool *pool, struct page *page)
 static bool page_pool_recycle_in_cache(struct page *page,
 				       struct page_pool *pool)
 {
-	if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE)) {
+	if (pool->alloc.count == PP_ALLOC_CACHE_SIZE) {
 		recycle_stat_inc(pool, cache_full);
 		return false;
 	}
@@ -546,6 +555,8 @@  static bool page_pool_recycle_in_cache(struct page *page,
 	return true;
 }
 
+static void page_pool_empty_ring(struct page_pool *pool);
+
 /* If the page refcnt == 1, this will try to recycle the page.
  * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
  * the configured size min(dma_sync_size, pool->max_len).
@@ -572,7 +583,8 @@  __page_pool_put_page(struct page_pool *pool, struct page *page,
 			page_pool_dma_sync_for_device(pool, page,
 						      dma_sync_size);
 
-		if (allow_direct && in_softirq() &&
+		/* During PP shutdown, no direct recycle must occur */
+		if (likely(allow_direct && in_softirq()) &&
 		    page_pool_recycle_in_cache(page, pool))
 			return NULL;
 
@@ -609,6 +621,8 @@  void page_pool_put_defragged_page(struct page_pool *pool, struct page *page,
 		recycle_stat_inc(pool, ring_full);
 		page_pool_return_page(pool, page);
 	}
+	if (page && pool->p.flags & PP_FLAG_SHUTDOWN)
+		page_pool_empty_ring(pool);
 }
 EXPORT_SYMBOL(page_pool_put_defragged_page);
 
@@ -646,6 +660,9 @@  void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 	recycle_stat_add(pool, ring, i);
 	page_pool_ring_unlock(pool);
 
+	if (pool->p.flags & PP_FLAG_SHUTDOWN)
+		page_pool_empty_ring(pool);
+
 	/* Hopefully all pages was return into ptr_ring */
 	if (likely(i == bulk_len))
 		return;
@@ -737,12 +754,18 @@  struct page *page_pool_alloc_frag(struct page_pool *pool,
 }
 EXPORT_SYMBOL(page_pool_alloc_frag);
 
+noinline
 static void page_pool_empty_ring(struct page_pool *pool)
 {
-	struct page *page;
+	struct page *page, *next;
+
+	next = ptr_ring_consume_bh(&pool->ring);
 
 	/* Empty recycle ring */
-	while ((page = ptr_ring_consume_bh(&pool->ring))) {
+	while (next) {
+		page = next;
+		next = ptr_ring_consume_bh(&pool->ring);
+
 		/* Verify the refcnt invariant of cached pages */
 		if (!(page_ref_count(page) == 1))
 			pr_crit("%s() page_pool refcnt %d violation\n",
@@ -768,6 +791,14 @@  static void page_pool_free(struct page_pool *pool)
 	kfree(pool);
 }
 
+static void page_pool_free_rcu(struct rcu_head *rcu)
+{
+	struct page_pool *pool;
+
+	pool = container_of(rcu, struct page_pool, rcu);
+	page_pool_free(pool);
+}
+
 static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 {
 	struct page *page;
@@ -796,39 +827,30 @@  static void page_pool_scrub(struct page_pool *pool)
 	page_pool_empty_ring(pool);
 }
 
-static int page_pool_release(struct page_pool *pool)
+noinline
+static int page_pool_free_attempt(struct page_pool *pool, u32 release_cnt)
 {
 	int inflight;
 
-	page_pool_scrub(pool);
-	inflight = page_pool_inflight(pool);
+	rcu_read_lock();
+	inflight = __page_pool_inflight(pool, release_cnt);
+	rcu_read_unlock();
 	if (!inflight)
-		page_pool_free(pool);
+		call_rcu(&pool->rcu, page_pool_free_rcu);
 
 	return inflight;
 }
 
-static void page_pool_release_retry(struct work_struct *wq)
+static int page_pool_release(struct page_pool *pool)
 {
-	struct delayed_work *dwq = to_delayed_work(wq);
-	struct page_pool *pool = container_of(dwq, typeof(*pool), release_dw);
 	int inflight;
 
-	inflight = page_pool_release(pool);
+	page_pool_scrub(pool);
+	inflight = page_pool_inflight(pool);
 	if (!inflight)
-		return;
-
-	/* Periodic warning */
-	if (time_after_eq(jiffies, pool->defer_warn)) {
-		int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;
-
-		pr_warn("%s() stalled pool shutdown %d inflight %d sec\n",
-			__func__, inflight, sec);
-		pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
-	}
+		page_pool_free(pool);
 
-	/* Still not ready to be disconnected, retry later */
-	schedule_delayed_work(&pool->release_dw, DEFER_TIME);
+	return inflight;
 }
 
 void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
@@ -856,6 +878,10 @@  EXPORT_SYMBOL(page_pool_unlink_napi);
 
 void page_pool_destroy(struct page_pool *pool)
 {
+	unsigned int flags;
+	u32 release_cnt;
+	u32 hold_cnt;
+
 	if (!pool)
 		return;
 
@@ -868,11 +894,45 @@  void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_release(pool))
 		return;
 
-	pool->defer_start = jiffies;
-	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
+	/* PP have pages inflight, thus cannot immediately release memory.
+	 * Enter into shutdown phase, depending on remaining in-flight PP
+	 * pages to trigger shutdown process (on concurrent CPUs) and last
+	 * page will free pool instance.
+	 *
+	 * There exist two race conditions here, we need to take into
+	 * account in the following code.
+	 *
+	 * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
+	 *    pages into the ptr_ring.  Thus, it missed triggering shutdown
+	 *    process, which can then be stalled forever.
+	 *
+	 * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
+	 *    page, which triggered shutdown process and freed pool
+	 *    instance. Thus, its not safe to dereference *pool afterwards.
+	 *
+	 * Handling races by holding a fake in-flight count, via artificially
+	 * bumping pages_state_hold_cnt, which assures pool isn't freed under
+	 * us.  Use RCU Grace-Periods to guarantee concurrent CPUs will
+	 * transition safely into the shutdown phase.
+	 *
+	 * After safely transition into this state the races are resolved.  For
+	 * race(1) its safe to recheck and empty ptr_ring (it will not free
+	 * pool). Race(2) cannot happen, and we can release fake in-flight count
+	 * as last step.
+	 */
+	hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
+	WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt);
+	synchronize_rcu();
+
+	flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;
+	WRITE_ONCE(pool->p.flags, flags);
+	synchronize_rcu();
+
+	/* Concurrent CPUs could have returned last pages into ptr_ring */
+	page_pool_empty_ring(pool);
 
-	INIT_DELAYED_WORK(&pool->release_dw, page_pool_release_retry);
-	schedule_delayed_work(&pool->release_dw, DEFER_TIME);
+	release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
+	page_pool_free_attempt(pool, release_cnt);
 }
 EXPORT_SYMBOL(page_pool_destroy);