diff mbox series

[RFC,net] net: make page pool stall netdev unregistration to avoid IOMMU crashes

Message ID 20240806151618.1373008-1-kuba@kernel.org (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net] net: make page pool stall netdev unregistration to avoid IOMMU crashes | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 68 this patch: 68
netdev/build_tools success Errors and warnings before: 9 this patch: 9
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 115 this patch: 115
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4090 this patch: 4090
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 97 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 22 this patch: 22
netdev/source_inline success Was 0 now: 0

Commit Message

Jakub Kicinski Aug. 6, 2024, 3:16 p.m. UTC
There appears to be no clean way to hold onto the IOMMU, so page pool
cannot outlast the driver which created it. We have no way to stall
the driver unregister, but we can use netdev unregistration as a proxy.

Note that page pool pages may last forever, we have seen it happen
e.g. when application leaks a socket and page is stuck in its rcv queue.
Hopefully this is fine in this particular case, as we will only stall
unregistering of devices which want the page pool to manage the DMA
mapping for them, i.e. HW backed netdevs. And obviously keeping
the netdev around is preferable to a crash.

More work is needed for weird drivers which share one pool among
multiple netdevs, as they are not allowed to set the pp->netdev
pointer. We probably need to add a bit that says "don't expose
to uAPI for them".

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
Untested, but I think it would work.. if it's not too controversial.

CC: Jesper Dangaard Brouer <hawk@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Yonglong Liu <liuyonglong@huawei.com>
CC: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/linux/netdevice.h |  4 ++++
 net/core/page_pool_user.c | 44 +++++++++++++++++++++++++++++++--------
 2 files changed, 39 insertions(+), 9 deletions(-)

Comments

Yonglong Liu Aug. 7, 2024, 7:03 a.m. UTC | #1
I tested this patch, have the same crash as I reported...

On 2024/8/6 23:16, Jakub Kicinski wrote:
> There appears to be no clean way to hold onto the IOMMU, so page pool
> cannot outlast the driver which created it. We have no way to stall
> the driver unregister, but we can use netdev unregistration as a proxy.
>
> Note that page pool pages may last forever, we have seen it happen
> e.g. when application leaks a socket and page is stuck in its rcv queue.
> Hopefully this is fine in this particular case, as we will only stall
> unregistering of devices which want the page pool to manage the DMA
> mapping for them, i.e. HW backed netdevs. And obviously keeping
> the netdev around is preferable to a crash.
>
> More work is needed for weird drivers which share one pool among
> multiple netdevs, as they are not allowed to set the pp->netdev
> pointer. We probably need to add a bit that says "don't expose
> to uAPI for them".
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> Untested, but I think it would work.. if it's not too controversial.
>
> CC: Jesper Dangaard Brouer <hawk@kernel.org>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> CC: Yonglong Liu <liuyonglong@huawei.com>
> CC: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>   include/linux/netdevice.h |  4 ++++
>   net/core/page_pool_user.c | 44 +++++++++++++++++++++++++++++++--------
>   2 files changed, 39 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0ef3eaa23f4b..c817bde7bacc 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2342,6 +2342,8 @@ struct net_device {
>   	struct lock_class_key	*qdisc_tx_busylock;
>   	bool			proto_down;
>   	bool			threaded;
> +	/** @pp_unreg_pending: page pool code is stalling unregister */
> +	bool			pp_unreg_pending;
>   
>   	struct list_head	net_notifier_list;
>   
> @@ -2371,6 +2373,8 @@ struct net_device {
>   #if IS_ENABLED(CONFIG_PAGE_POOL)
>   	/** @page_pools: page pools created for this netdevice */
>   	struct hlist_head	page_pools;
> +	/** @pp_dev_tracker: ref tracker for page pool code stalling unreg */
> +	netdevice_tracker	pp_dev_tracker;
>   #endif
>   
>   	/** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */
> diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
> index 3a3277ba167b..1a4135f01130 100644
> --- a/net/core/page_pool_user.c
> +++ b/net/core/page_pool_user.c
> @@ -349,22 +349,36 @@ static void page_pool_unreg_netdev_wipe(struct net_device *netdev)
>   	struct page_pool *pool;
>   	struct hlist_node *n;
>   
> -	mutex_lock(&page_pools_lock);
>   	hlist_for_each_entry_safe(pool, n, &netdev->page_pools, user.list) {
>   		hlist_del_init(&pool->user.list);
>   		pool->slow.netdev = NET_PTR_POISON;
>   	}
> -	mutex_unlock(&page_pools_lock);
>   }
>   
> -static void page_pool_unreg_netdev(struct net_device *netdev)
> +static void page_pool_unreg_netdev_stall(struct net_device *netdev)
> +{
> +	if (!netdev->pp_unreg_pending) {
> +		netdev_hold(netdev, &netdev->pp_dev_tracker, GFP_KERNEL);
> +		netdev->pp_unreg_pending = true;
> +	} else {
> +		netdev_warn(netdev,
> +			    "page pool release stalling device unregister");
> +	}
> +}
> +
> +static void page_pool_unreg_netdev_unstall(struct net_device *netdev)
> +{
> +	netdev_put(netdev, &netdev->pp_dev_tracker);
> +	netdev->pp_unreg_pending = false;
> +}
> +
> +static void page_pool_unreg_netdev_reparent(struct net_device *netdev)
>   {
>   	struct page_pool *pool, *last;
>   	struct net_device *lo;
>   
>   	lo = dev_net(netdev)->loopback_dev;
>   
> -	mutex_lock(&page_pools_lock);
>   	last = NULL;
>   	hlist_for_each_entry(pool, &netdev->page_pools, user.list) {
>   		pool->slow.netdev = lo;
> @@ -375,7 +389,6 @@ static void page_pool_unreg_netdev(struct net_device *netdev)
>   	if (last)
>   		hlist_splice_init(&netdev->page_pools, &last->user.list,
>   				  &lo->page_pools);
> -	mutex_unlock(&page_pools_lock);
>   }
>   
>   static int
> @@ -383,17 +396,30 @@ page_pool_netdevice_event(struct notifier_block *nb,
>   			  unsigned long event, void *ptr)
>   {
>   	struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
> +	struct page_pool *pool;
> +	bool has_dma;
>   
>   	if (event != NETDEV_UNREGISTER)
>   		return NOTIFY_DONE;
>   
> -	if (hlist_empty(&netdev->page_pools))
> +	if (hlist_empty(&netdev->page_pools) && !netdev->pp_unreg_pending)
>   		return NOTIFY_OK;
>   
> -	if (netdev->ifindex != LOOPBACK_IFINDEX)
> -		page_pool_unreg_netdev(netdev);
> -	else
> +	mutex_lock(&page_pools_lock);
> +	has_dma = false;
> +	hlist_for_each_entry(pool, &netdev->page_pools, user.list)
> +		has_dma |= pool->slow.flags & PP_FLAG_DMA_MAP;
> +
> +	if (has_dma)
> +		page_pool_unreg_netdev_stall(netdev);
> +	else if (netdev->pp_unreg_pending)
> +		page_pool_unreg_netdev_unstall(netdev);
> +	else if (netdev->ifindex == LOOPBACK_IFINDEX)
>   		page_pool_unreg_netdev_wipe(netdev);
> +	else /* driver doesn't let page pools manage DMA addrs */
> +		page_pool_unreg_netdev_reparent(netdev);
> +	mutex_unlock(&page_pools_lock);
> +
>   	return NOTIFY_OK;
>   }
>
Yunsheng Lin Aug. 7, 2024, 11 a.m. UTC | #2
On 2024/8/6 23:16, Jakub Kicinski wrote:
> There appears to be no clean way to hold onto the IOMMU, so page pool
> cannot outlast the driver which created it. We have no way to stall
> the driver unregister, but we can use netdev unregistration as a proxy.
> 
> Note that page pool pages may last forever, we have seen it happen
> e.g. when application leaks a socket and page is stuck in its rcv queue.

We saw some page_pool pages might last forever too, but were not sure
if it was the same reason as above? Are there some cmds/ways to debug
if a application leaks a socket and page is stuck in its rcv queue?

> Hopefully this is fine in this particular case, as we will only stall
> unregistering of devices which want the page pool to manage the DMA
> mapping for them, i.e. HW backed netdevs. And obviously keeping
> the netdev around is preferable to a crash.
> 
> More work is needed for weird drivers which share one pool among
> multiple netdevs, as they are not allowed to set the pp->netdev
> pointer. We probably need to add a bit that says "don't expose
> to uAPI for them".
>
Jakub Kicinski Aug. 7, 2024, 2:27 p.m. UTC | #3
On Wed, 7 Aug 2024 15:03:06 +0800 Yonglong Liu wrote:
> I tested this patch, have the same crash as I reported...

Which driver, this is only gonna work if the driver hooks the netdev 
to the page pool. Which is still rare for drivers lacking community
involvement.
Jakub Kicinski Aug. 7, 2024, 2:29 p.m. UTC | #4
On Wed, 7 Aug 2024 19:00:35 +0800 Yunsheng Lin wrote:
> > Note that page pool pages may last forever, we have seen it happen
> > e.g. when application leaks a socket and page is stuck in its rcv queue.  
> 
> We saw some page_pool pages might last forever too, but were not sure
> if it was the same reason as above? Are there some cmds/ways to debug
> if a application leaks a socket and page is stuck in its rcv queue?

I used drgn to scan all sockets to find the page.

> > Hopefully this is fine in this particular case, as we will only stall
> > unregistering of devices which want the page pool to manage the DMA
> > mapping for them, i.e. HW backed netdevs. And obviously keeping
> > the netdev around is preferable to a crash.
Yonglong Liu Aug. 8, 2024, 1:11 a.m. UTC | #5
On 2024/8/7 22:27, Jakub Kicinski wrote:
> On Wed, 7 Aug 2024 15:03:06 +0800 Yonglong Liu wrote:
>> I tested this patch, have the same crash as I reported...
> Which driver, this is only gonna work if the driver hooks the netdev
> to the page pool. Which is still rare for drivers lacking community
> involvement.
hns3 driver, we didn't hooks the netdev to the page pool, will try this 
later, thanks : )
Ilias Apalodimas Aug. 8, 2024, 11:12 a.m. UTC | #6
Hi Jakub,

On Tue, 6 Aug 2024 at 18:16, Jakub Kicinski <kuba@kernel.org> wrote:
>
> There appears to be no clean way to hold onto the IOMMU, so page pool
> cannot outlast the driver which created it. We have no way to stall
> the driver unregister, but we can use netdev unregistration as a proxy.

Isn't the inflight machinery enough?
Looking at the page_pool_destroy() path, we eventually call
page_pool_return_page() which will try to unmap memory. That won't be
called if we have inflight packets.
In any case why do you want to hold on the IOMMU? The network
interface -- at least in theory -- should be down and we wont be
processing any more packets.

Regards
/Ilias
>
> Note that page pool pages may last forever, we have seen it happen
> e.g. when application leaks a socket and page is stuck in its rcv queue.
> Hopefully this is fine in this particular case, as we will only stall
> unregistering of devices which want the page pool to manage the DMA
> mapping for them, i.e. HW backed netdevs. And obviously keeping
> the netdev around is preferable to a crash.
>
> More work is needed for weird drivers which share one pool among
> multiple netdevs, as they are not allowed to set the pp->netdev
> pointer. We probably need to add a bit that says "don't expose
> to uAPI for them".
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> Untested, but I think it would work.. if it's not too controversial.
>
> CC: Jesper Dangaard Brouer <hawk@kernel.org>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> CC: Yonglong Liu <liuyonglong@huawei.com>
> CC: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  include/linux/netdevice.h |  4 ++++
>  net/core/page_pool_user.c | 44 +++++++++++++++++++++++++++++++--------
>  2 files changed, 39 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0ef3eaa23f4b..c817bde7bacc 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2342,6 +2342,8 @@ struct net_device {
>         struct lock_class_key   *qdisc_tx_busylock;
>         bool                    proto_down;
>         bool                    threaded;
> +       /** @pp_unreg_pending: page pool code is stalling unregister */
> +       bool                    pp_unreg_pending;
>
>         struct list_head        net_notifier_list;
>
> @@ -2371,6 +2373,8 @@ struct net_device {
>  #if IS_ENABLED(CONFIG_PAGE_POOL)
>         /** @page_pools: page pools created for this netdevice */
>         struct hlist_head       page_pools;
> +       /** @pp_dev_tracker: ref tracker for page pool code stalling unreg */
> +       netdevice_tracker       pp_dev_tracker;
>  #endif
>
>         /** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */
> diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
> index 3a3277ba167b..1a4135f01130 100644
> --- a/net/core/page_pool_user.c
> +++ b/net/core/page_pool_user.c
> @@ -349,22 +349,36 @@ static void page_pool_unreg_netdev_wipe(struct net_device *netdev)
>         struct page_pool *pool;
>         struct hlist_node *n;
>
> -       mutex_lock(&page_pools_lock);
>         hlist_for_each_entry_safe(pool, n, &netdev->page_pools, user.list) {
>                 hlist_del_init(&pool->user.list);
>                 pool->slow.netdev = NET_PTR_POISON;
>         }
> -       mutex_unlock(&page_pools_lock);
>  }
>
> -static void page_pool_unreg_netdev(struct net_device *netdev)
> +static void page_pool_unreg_netdev_stall(struct net_device *netdev)
> +{
> +       if (!netdev->pp_unreg_pending) {
> +               netdev_hold(netdev, &netdev->pp_dev_tracker, GFP_KERNEL);
> +               netdev->pp_unreg_pending = true;
> +       } else {
> +               netdev_warn(netdev,
> +                           "page pool release stalling device unregister");
> +       }
> +}
> +
> +static void page_pool_unreg_netdev_unstall(struct net_device *netdev)
> +{
> +       netdev_put(netdev, &netdev->pp_dev_tracker);
> +       netdev->pp_unreg_pending = false;
> +}
> +
> +static void page_pool_unreg_netdev_reparent(struct net_device *netdev)
>  {
>         struct page_pool *pool, *last;
>         struct net_device *lo;
>
>         lo = dev_net(netdev)->loopback_dev;
>
> -       mutex_lock(&page_pools_lock);
>         last = NULL;
>         hlist_for_each_entry(pool, &netdev->page_pools, user.list) {
>                 pool->slow.netdev = lo;
> @@ -375,7 +389,6 @@ static void page_pool_unreg_netdev(struct net_device *netdev)
>         if (last)
>                 hlist_splice_init(&netdev->page_pools, &last->user.list,
>                                   &lo->page_pools);
> -       mutex_unlock(&page_pools_lock);
>  }
>
>  static int
> @@ -383,17 +396,30 @@ page_pool_netdevice_event(struct notifier_block *nb,
>                           unsigned long event, void *ptr)
>  {
>         struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
> +       struct page_pool *pool;
> +       bool has_dma;
>
>         if (event != NETDEV_UNREGISTER)
>                 return NOTIFY_DONE;
>
> -       if (hlist_empty(&netdev->page_pools))
> +       if (hlist_empty(&netdev->page_pools) && !netdev->pp_unreg_pending)
>                 return NOTIFY_OK;
>
> -       if (netdev->ifindex != LOOPBACK_IFINDEX)
> -               page_pool_unreg_netdev(netdev);
> -       else
> +       mutex_lock(&page_pools_lock);
> +       has_dma = false;
> +       hlist_for_each_entry(pool, &netdev->page_pools, user.list)
> +               has_dma |= pool->slow.flags & PP_FLAG_DMA_MAP;
> +
> +       if (has_dma)
> +               page_pool_unreg_netdev_stall(netdev);
> +       else if (netdev->pp_unreg_pending)
> +               page_pool_unreg_netdev_unstall(netdev);
> +       else if (netdev->ifindex == LOOPBACK_IFINDEX)
>                 page_pool_unreg_netdev_wipe(netdev);
> +       else /* driver doesn't let page pools manage DMA addrs */
> +               page_pool_unreg_netdev_reparent(netdev);
> +       mutex_unlock(&page_pools_lock);
> +
>         return NOTIFY_OK;
>  }
>
> --
> 2.45.2
>
Yonglong Liu Aug. 8, 2024, 12:52 p.m. UTC | #7
On 2024/8/7 22:29, Jakub Kicinski wrote:
> On Wed, 7 Aug 2024 19:00:35 +0800 Yunsheng Lin wrote:
>>> Note that page pool pages may last forever, we have seen it happen
>>> e.g. when application leaks a socket and page is stuck in its rcv queue.
>> We saw some page_pool pages might last forever too, but were not sure
>> if it was the same reason as above? Are there some cmds/ways to debug
>> if a application leaks a socket and page is stuck in its rcv queue?
> I used drgn to scan all sockets to find the page.

I hooks the netdev to the page pool, and run with this patch for a 
while, then get

the following messages, and the vf can not disable:

[ 1950.137586] hns3 0000:7d:01.0 eno1v0: link up
[ 1950.137671] hns3 0000:7d:01.0 eno1v0: net open
[ 1950.147098] 8021q: adding VLAN 0 to HW filter on device eno1v0
[ 1974.287476] hns3 0000:7d:01.0 eno1v0: net stop
[ 1974.294359] hns3 0000:7d:01.0 eno1v0: link down
[ 1975.596916] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1976.744947] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1977.900916] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1979.080929] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1980.236914] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1981.384913] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1982.568918] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1983.720912] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1984.584941] unregister_netdevice: waiting for eno1v0 to become free. 
Usage count = 2
[ 1984.872930] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1986.024924] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1987.176927] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1988.328922] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1989.480917] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1990.632913] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1991.784915] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister

...

[ 8640.008931] unregister_netdevice: waiting for eno1v0 to become free. 
Usage count = 2
[ 8640.300912] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 8641.452910] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 8642.600939] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 8643.756914] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 8644.904922] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 8646.060910] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 8647.208909] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 8648.360931] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister


I install drgn, but don't know how to find out the using pages, would 
you guide me on how to use it?

>>> Hopefully this is fine in this particular case, as we will only stall
>>> unregistering of devices which want the page pool to manage the DMA
>>> mapping for them, i.e. HW backed netdevs. And obviously keeping
>>> the netdev around is preferable to a crash.
Jakub Kicinski Aug. 8, 2024, 1:52 p.m. UTC | #8
On Thu, 8 Aug 2024 14:12:34 +0300 Ilias Apalodimas wrote:
> In any case why do you want to hold on the IOMMU? The network
> interface -- at least in theory -- should be down and we wont be
> processing any more packets.

I should have you a link to Yonglong's report:

https://lore.kernel.org/all/8743264a-9700-4227-a556-5f931c720211@huawei.com/

we get_device() hoping that it will keep the IOMMU machinery active
(even if the device won't use the page we need to unmap it when it's
freed), but it sounds like IOMMU dies when driver is unbound. Even if
there are outstanding references to the device.

I occasionally hit this problem reloading drivers during development,
TBH, too. And we have been told we "use the API wrong" so let's fix
it on our end?..
Jakub Kicinski Aug. 8, 2024, 2:05 p.m. UTC | #9
On Thu, 8 Aug 2024 20:52:52 +0800 Yonglong Liu wrote:
> I hooks the netdev to the page pool, and run with this patch for a 
> while, then get
> 
> the following messages, and the vf can not disable:

> [ 1950.137586] hns3 0000:7d:01.0 eno1v0: link up
> [ 1950.137671] hns3 0000:7d:01.0 eno1v0: net open
> [ 1950.147098] 8021q: adding VLAN 0 to HW filter on device eno1v0
> [ 1974.287476] hns3 0000:7d:01.0 eno1v0: net stop
> [ 1974.294359] hns3 0000:7d:01.0 eno1v0: link down
> [ 1975.596916] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
> release stalling device unregister
> [ 1976.744947] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
> release stalling device unregister

So.. the patch works? :) We may want to add this to get the info prints
back:

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 2abe6e919224..26bc1618de7c 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -1021,11 +1021,12 @@ static void page_pool_release_retry(struct work_struct *wq)
 	/* Periodic warning for page pools the user can't see */
 	netdev = READ_ONCE(pool->slow.netdev);
 	if (time_after_eq(jiffies, pool->defer_warn) &&
-	    (!netdev || netdev == NET_PTR_POISON)) {
+	    (!netdev || netdev == NET_PTR_POISON || netdev->pp_unreg_pending)) {
 		int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;
 
-		pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n",
-			__func__, pool->user.id, inflight, sec);
+		pr_warn("%s(): %s stalled pool shutdown: id %u, %d inflight %d sec (hold netdev: %d)\n",
+			__func__, netdev ? netdev_name(netdev) : "",
+			pool->user.id, inflight, sec, pool->defer_warn);
 		pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
 	}
 

> I install drgn, but don't know how to find out the using pages, would 
> you guide me on how to use it?

You can use this sample as a starting point:

https://github.com/osandov/drgn/blob/main/contrib/tcp_sock.py

but if the pages are actually leaked (rather than sitting in a socket),
you'll have to scan pages, not sockets. And figure out how they got leaked.
Somehow...
Ilias Apalodimas Aug. 8, 2024, 2:30 p.m. UTC | #10
Hi Jakub,

On Thu, 8 Aug 2024 at 16:52, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 8 Aug 2024 14:12:34 +0300 Ilias Apalodimas wrote:
> > In any case why do you want to hold on the IOMMU? The network
> > interface -- at least in theory -- should be down and we wont be
> > processing any more packets.
>
> I should have you a link to Yonglong's report:
>
> https://lore.kernel.org/all/8743264a-9700-4227-a556-5f931c720211@huawei.com/

Ah I read that bug report during my vacation but completely forgot about it...
Thanks this helps.

>
> we get_device() hoping that it will keep the IOMMU machinery active
> (even if the device won't use the page we need to unmap it when it's
> freed), but it sounds like IOMMU dies when driver is unbound. Even if
> there are outstanding references to the device.
>
> I occasionally hit this problem reloading drivers during development,
> TBH, too. And we have been told we "use the API wrong" so let's fix
> it on our end?..

It's been a while since I looked at the use cases, but I don't love
the idea of stalling the netdev removal until sockets process all
packets. There's a chance that the device will stay there forever.
I'll have to take a closer look but the first thing that comes to mind
is to unmap the pages early, before page_pool_destroy() is called and
perhaps add a flag that says "the pool is there only to process
existing packets, but you can't DMA into it anymore".

Thanks
/Ilias
Jakub Kicinski Aug. 8, 2024, 2:51 p.m. UTC | #11
On Thu, 8 Aug 2024 17:30:31 +0300 Ilias Apalodimas wrote:
> > we get_device() hoping that it will keep the IOMMU machinery active
> > (even if the device won't use the page we need to unmap it when it's
> > freed), but it sounds like IOMMU dies when driver is unbound. Even if
> > there are outstanding references to the device.
> >
> > I occasionally hit this problem reloading drivers during development,
> > TBH, too. And we have been told we "use the API wrong" so let's fix
> > it on our end?..  
> 
> It's been a while since I looked at the use cases, but I don't love
> the idea of stalling the netdev removal until sockets process all
> packets. There's a chance that the device will stay there forever.

True, my thinking is that there are 3 cases:
 - good case, nothing gets stalled
 - pages held, no IOMMU, we may make it worse, user doesn't care
 - pages held, IOMMU enabled, it would have crashed

given that we get so few reports about the third one, I tend towards
thinking that the risk of stall is somewhat limited.

> I'll have to take a closer look but the first thing that comes to mind
> is to unmap the pages early, before page_pool_destroy() is called and
> perhaps add a flag that says "the pool is there only to process
> existing packets, but you can't DMA into it anymore".

Yeah, but we need to find them... Maybe Alex knows how many rotten veg
will be thrown our way if we try to scan all struct pages, IDK if that's
considered acceptable.
Yonglong Liu Aug. 9, 2024, 6:06 a.m. UTC | #12
On 2024/8/8 22:05, Jakub Kicinski wrote:
> On Thu, 8 Aug 2024 20:52:52 +0800 Yonglong Liu wrote:
>> I hooks the netdev to the page pool, and run with this patch for a
>> while, then get
>>
>> the following messages, and the vf can not disable:
>> [ 1950.137586] hns3 0000:7d:01.0 eno1v0: link up
>> [ 1950.137671] hns3 0000:7d:01.0 eno1v0: net open
>> [ 1950.147098] 8021q: adding VLAN 0 to HW filter on device eno1v0
>> [ 1974.287476] hns3 0000:7d:01.0 eno1v0: net stop
>> [ 1974.294359] hns3 0000:7d:01.0 eno1v0: link down
>> [ 1975.596916] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool
>> release stalling device unregister
>> [ 1976.744947] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool
>> release stalling device unregister
> So.. the patch works? :) We may want to add this to get the info prints
> back:
>
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 2abe6e919224..26bc1618de7c 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -1021,11 +1021,12 @@ static void page_pool_release_retry(struct work_struct *wq)
>   	/* Periodic warning for page pools the user can't see */
>   	netdev = READ_ONCE(pool->slow.netdev);
>   	if (time_after_eq(jiffies, pool->defer_warn) &&
> -	    (!netdev || netdev == NET_PTR_POISON)) {
> +	    (!netdev || netdev == NET_PTR_POISON || netdev->pp_unreg_pending)) {
>   		int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;
>   
> -		pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n",
> -			__func__, pool->user.id, inflight, sec);
> +		pr_warn("%s(): %s stalled pool shutdown: id %u, %d inflight %d sec (hold netdev: %d)\n",
> +			__func__, netdev ? netdev_name(netdev) : "",
> +			pool->user.id, inflight, sec, pool->defer_warn);
>   		pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
>   	}
>   

Here is the log:

[ 1018.059215] hns3 0000:7d:01.0 eno1v0: net stop
[ 1018.066095] hns3 0000:7d:01.0 eno1v0: link down
[ 1019.340845] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1020.492848] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1021.648849] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1022.796850] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1023.980837] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1025.132850] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1026.284851] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1027.500853] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1028.364855] unregister_netdevice: waiting for eno1v0 to become free. 
Usage count = 2
[ 1028.652851] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1029.808845] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister
[ 1030.956843] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool 
release stalling device unregister

...

[ 1078.476854] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 
stalled pool shutdown: id 553, 82 inflight 60 sec (hold netdev: 194039)

...

[ 1138.892838] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 
stalled pool shutdown: id 553, 82 inflight 120 sec (hold netdev: 209147)

...

[ 1199.308841] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 
stalled pool shutdown: id 553, 82 inflight 181 sec (hold netdev: 224251)

...

[ 1199.308841] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 
stalled pool shutdown: id 553, 82 inflight 181 sec (hold netdev: 224251)

...

[ 1259.724849] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 
stalled pool shutdown: id 553, 82 inflight 241 sec (hold netdev: 239355)

...

[ 7603.436840] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 
stalled pool shutdown: id 553, 82 inflight 6585 sec (hold netdev: 1825283)
[ 7663.852858] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 
stalled pool shutdown: id 553, 82 inflight 6645 sec (hold netdev: 1840387)
[ 7724.272853] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 
stalled pool shutdown: id 553, 82 inflight 6706 sec (hold netdev: 1855491)


And also have the follow INFO (last time I forgot this):

[ 1211.213006] INFO: task systemd-journal:1598 blocked for more than 120 
seconds.
[ 1211.220217]       Not tainted 6.10.0-rc4+ #1
[ 1211.224480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1211.232284] task:systemd-journal state:D stack:0     pid:1598 
tgid:1598  ppid:1      flags:0x00000801
[ 1211.241568] Call trace:
[ 1211.244005]  __switch_to+0xec/0x138
[ 1211.247509]  __schedule+0x2f4/0x10e0
[ 1211.251080]  schedule+0x3c/0x148
[ 1211.254308]  schedule_preempt_disabled+0x2c/0x50
[ 1211.258927]  __mutex_lock.constprop.0+0x2b0/0x618
[ 1211.263625]  __mutex_lock_slowpath+0x1c/0x30
[ 1211.267888]  mutex_lock+0x40/0x58
[ 1211.271203]  uevent_show+0x90/0x140
[ 1211.274687]  dev_attr_show+0x28/0x80
[ 1211.278265]  sysfs_kf_seq_show+0xb4/0x138
[ 1211.282277]  kernfs_seq_show+0x34/0x48
[ 1211.286029]  seq_read_iter+0x1bc/0x4b8
[ 1211.289780]  kernfs_fop_read_iter+0x148/0x1c8
[ 1211.294128]  vfs_read+0x25c/0x308
[ 1211.297443]  ksys_read+0x70/0x108
[ 1211.300745]  __arm64_sys_read+0x24/0x38
[ 1211.304581]  invoke_syscall+0x50/0x128
[ 1211.308331]  el0_svc_common.constprop.0+0xc8/0xf0
[ 1211.313023]  do_el0_svc+0x24/0x38
[ 1211.316326]  el0_svc+0x38/0x100
[ 1211.319470]  el0t_64_sync_handler+0xc0/0xc8
[ 1211.323647]  el0t_64_sync+0x1a4/0x1a8

>> I install drgn, but don't know how to find out the using pages, would
>> you guide me on how to use it?
> You can use this sample as a starting point:
>
> https://github.com/osandov/drgn/blob/main/contrib/tcp_sock.py
>
> but if the pages are actually leaked (rather than sitting in a socket),
> you'll have to scan pages, not sockets. And figure out how they got leaked.
> Somehow...

Thanks : )
Jakub Kicinski Aug. 10, 2024, 3:57 a.m. UTC | #13
On Fri, 9 Aug 2024 14:06:02 +0800 Yonglong Liu wrote:
> [ 7724.272853] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 
> stalled pool shutdown: id 553, 82 inflight 6706 sec (hold netdev: 1855491)

Alright :( You gotta look around for those 82 pages somehow with drgn.
bpftrace+kfunc the work that does the periodic print to get the address
of the page pool struct and then look around for pages from that pp.. :(
Yonglong Liu Aug. 14, 2024, 10:09 a.m. UTC | #14
On 2024/8/10 11:57, Jakub Kicinski wrote:
> On Fri, 9 Aug 2024 14:06:02 +0800 Yonglong Liu wrote:
>> [ 7724.272853] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0
>> stalled pool shutdown: id 553, 82 inflight 6706 sec (hold netdev: 1855491)
> Alright :( You gotta look around for those 82 pages somehow with drgn.
> bpftrace+kfunc the work that does the periodic print to get the address
> of the page pool struct and then look around for pages from that pp.. :(

I spent some time to learn how to use the drgn, and found those page, 
but I think those page

is allocated by the hns3 driver, how to find out who own those page now?

The following is a page info:

*(struct page *)0xfffffe004359de80 = { .flags = (unsigned 
long)432345014471753728, .lru = (struct list_head){ .next = (struct 
list_head *)0xdead000000000040, .prev = (struct list_head 
*)0xffff002105c18000, }, .__filler = (void *)0xdead000000000040, 
.mlock_count = (unsigned int)96567296, .buddy_list = (struct list_head){ 
.next = (struct list_head *)0xdead000000000040, .prev = (struct 
list_head *)0xffff002105c18000, }, .pcp_list = (struct list_head){ .next 
= (struct list_head *)0xdead000000000040, .prev = (struct list_head 
*)0xffff002105c18000, }, .mapping = (struct address_space *)0x0, .index 
= (unsigned long)4223168512, .share = (unsigned long)4223168512, 
.private = (unsigned long)2, .pp_magic = (unsigned 
long)16045481047390945344, .pp = (struct page_pool *)0xffff002105c18000, 
._pp_mapping_pad = (unsigned long)0, .dma_addr = (unsigned 
long)4223168512, .pp_ref_count = (atomic_long_t){ .counter = (s64)2, }, 
.compound_head = (unsigned long)16045481047390945344, .pgmap = (struct 
dev_pagemap *)0xdead000000000040, .zone_device_data = (void 
*)0xffff002105c18000, .callback_head = (struct callback_head){ .next = 
(struct callback_head *)0xdead000000000040, .func = (void (*)(struct 
callback_head *))0xffff002105c18000, }, ._mapcount = (atomic_t){ 
.counter = (int)-1, }, .page_type = (unsigned int)4294967295, ._refcount 
= (atomic_t){ .counter = (int)1, }, .memcg_data = (unsigned long)0, }
Jakub Kicinski Aug. 14, 2024, 2:56 p.m. UTC | #15
On Wed, 14 Aug 2024 18:09:59 +0800 Yonglong Liu wrote:
> On 2024/8/10 11:57, Jakub Kicinski wrote:
> > On Fri, 9 Aug 2024 14:06:02 +0800 Yonglong Liu wrote:  
> >> [ 7724.272853] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0
> >> stalled pool shutdown: id 553, 82 inflight 6706 sec (hold netdev: 1855491)  
> > Alright :( You gotta look around for those 82 pages somehow with drgn.
> > bpftrace+kfunc the work that does the periodic print to get the address
> > of the page pool struct and then look around for pages from that pp.. :(  
> 
> I spent some time to learn how to use the drgn, and found those page, 
> but I think those page
> 
> is allocated by the hns3 driver, how to find out who own those page now?

Scan the entire system memory looking for the pointer to this page.
Dump the memory around location which hold that pointer. If you're
lucky the page will be held by an skb, and the memory around it will
look like struct skb_shared_info. If you're less lucky the page is used
by sk_buff for the head and address will not be exact. If you're less
lucky still the page will be directly leaked by the driver, and not
pointed to by anything...

I think the last case is most likely, FWIW.
Yunsheng Lin Sept. 5, 2024, 10:47 a.m. UTC | #16
On 2024/8/6 23:16, Jakub Kicinski wrote:
> There appears to be no clean way to hold onto the IOMMU, so page pool
> cannot outlast the driver which created it. We have no way to stall
> the driver unregister, but we can use netdev unregistration as a proxy.
> 
> Note that page pool pages may last forever, we have seen it happen
> e.g. when application leaks a socket and page is stuck in its rcv queue.

I am assuming the page will be released when the application dies or
exits, right?

Also I am not sure if the above application is privileged one or not?
If it is not a privileged one, perhaps we need to fix the above problem
in the kernel as it does not seem to make sense for a unprivileged
application to have the kernel leaking page and stall the unregistering
of devices.

> Hopefully this is fine in this particular case, as we will only stall
> unregistering of devices which want the page pool to manage the DMA
> mapping for them, i.e. HW backed netdevs. And obviously keeping
> the netdev around is preferable to a crash.

For the internal testing and debugging, it seems there are at least
two cases that the page is not released fast enough for now:
1. ipv4 packet defragmentation timeout: this seems to cause delay up
   to 30 secs:
#define IP_FRAG_TIME	(30 * HZ)		/* fragment lifetime	*/

2. skb_defer_free_flush(): this may cause infinite delay if there is
   no triggering for net_rx_action(). Below is the dump_stack() when
   the page is returned back to page_pool after reloading the driver,
   causing the triggering of net_rx_action():

[  515.286580] Call trace:
[  515.289012]  dump_backtrace+0x9c/0x100
[  515.292748]  show_stack+0x20/0x38
[  515.296049]  dump_stack_lvl+0x78/0x90
[  515.299699]  dump_stack+0x18/0x28
[  515.303001]  page_pool_put_unrefed_netmem+0x2c4/0x3d0
[  515.308039]  napi_pp_put_page+0xb4/0xe0
[  515.311863]  skb_release_data+0xf8/0x1e0
[  515.315772]  kfree_skb_list_reason+0xb4/0x2a0
[  515.320115]  skb_release_data+0x148/0x1e0
[  515.324111]  napi_consume_skb+0x64/0x190
[  515.328021]  net_rx_action+0x110/0x2a8
[  515.331758]  handle_softirqs+0x120/0x368
[  515.335668]  __do_softirq+0x1c/0x28
[  515.339143]  ____do_softirq+0x18/0x30
[  515.342792]  call_on_irq_stack+0x24/0x58
[  515.346701]  do_softirq_own_stack+0x24/0x38
[  515.350871]  irq_exit_rcu+0x94/0xd0
[  515.354347]  el1_interrupt+0x38/0x68
[  515.357910]  el1h_64_irq_handler+0x18/0x28
[  515.361994]  el1h_64_irq+0x64/0x68
[  515.365382]  default_idle_call+0x34/0x140
[  515.369378]  do_idle+0x20c/0x270
[  515.372593]  cpu_startup_entry+0x40/0x50
[  515.376503]  secondary_start_kernel+0x138/0x160
[  515.381021]  __secondary_switched+0xb8/0xc0

> 
> More work is needed for weird drivers which share one pool among
> multiple netdevs, as they are not allowed to set the pp->netdev
> pointer. We probably need to add a bit that says "don't expose
> to uAPI for them".

Which driver are we talking about here sharing one pool among multiple
netdevs? Is the sharing for memory saving?

>
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0ef3eaa23f4b..c817bde7bacc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2342,6 +2342,8 @@  struct net_device {
 	struct lock_class_key	*qdisc_tx_busylock;
 	bool			proto_down;
 	bool			threaded;
+	/** @pp_unreg_pending: page pool code is stalling unregister */
+	bool			pp_unreg_pending;
 
 	struct list_head	net_notifier_list;
 
@@ -2371,6 +2373,8 @@  struct net_device {
 #if IS_ENABLED(CONFIG_PAGE_POOL)
 	/** @page_pools: page pools created for this netdevice */
 	struct hlist_head	page_pools;
+	/** @pp_dev_tracker: ref tracker for page pool code stalling unreg */
+	netdevice_tracker	pp_dev_tracker;
 #endif
 
 	/** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */
diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
index 3a3277ba167b..1a4135f01130 100644
--- a/net/core/page_pool_user.c
+++ b/net/core/page_pool_user.c
@@ -349,22 +349,36 @@  static void page_pool_unreg_netdev_wipe(struct net_device *netdev)
 	struct page_pool *pool;
 	struct hlist_node *n;
 
-	mutex_lock(&page_pools_lock);
 	hlist_for_each_entry_safe(pool, n, &netdev->page_pools, user.list) {
 		hlist_del_init(&pool->user.list);
 		pool->slow.netdev = NET_PTR_POISON;
 	}
-	mutex_unlock(&page_pools_lock);
 }
 
-static void page_pool_unreg_netdev(struct net_device *netdev)
+static void page_pool_unreg_netdev_stall(struct net_device *netdev)
+{
+	if (!netdev->pp_unreg_pending) {
+		netdev_hold(netdev, &netdev->pp_dev_tracker, GFP_KERNEL);
+		netdev->pp_unreg_pending = true;
+	} else {
+		netdev_warn(netdev,
+			    "page pool release stalling device unregister");
+	}
+}
+
+static void page_pool_unreg_netdev_unstall(struct net_device *netdev)
+{
+	netdev_put(netdev, &netdev->pp_dev_tracker);
+	netdev->pp_unreg_pending = false;
+}
+
+static void page_pool_unreg_netdev_reparent(struct net_device *netdev)
 {
 	struct page_pool *pool, *last;
 	struct net_device *lo;
 
 	lo = dev_net(netdev)->loopback_dev;
 
-	mutex_lock(&page_pools_lock);
 	last = NULL;
 	hlist_for_each_entry(pool, &netdev->page_pools, user.list) {
 		pool->slow.netdev = lo;
@@ -375,7 +389,6 @@  static void page_pool_unreg_netdev(struct net_device *netdev)
 	if (last)
 		hlist_splice_init(&netdev->page_pools, &last->user.list,
 				  &lo->page_pools);
-	mutex_unlock(&page_pools_lock);
 }
 
 static int
@@ -383,17 +396,30 @@  page_pool_netdevice_event(struct notifier_block *nb,
 			  unsigned long event, void *ptr)
 {
 	struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
+	struct page_pool *pool;
+	bool has_dma;
 
 	if (event != NETDEV_UNREGISTER)
 		return NOTIFY_DONE;
 
-	if (hlist_empty(&netdev->page_pools))
+	if (hlist_empty(&netdev->page_pools) && !netdev->pp_unreg_pending)
 		return NOTIFY_OK;
 
-	if (netdev->ifindex != LOOPBACK_IFINDEX)
-		page_pool_unreg_netdev(netdev);
-	else
+	mutex_lock(&page_pools_lock);
+	has_dma = false;
+	hlist_for_each_entry(pool, &netdev->page_pools, user.list)
+		has_dma |= pool->slow.flags & PP_FLAG_DMA_MAP;
+
+	if (has_dma)
+		page_pool_unreg_netdev_stall(netdev);
+	else if (netdev->pp_unreg_pending)
+		page_pool_unreg_netdev_unstall(netdev);
+	else if (netdev->ifindex == LOOPBACK_IFINDEX)
 		page_pool_unreg_netdev_wipe(netdev);
+	else /* driver doesn't let page pools manage DMA addrs */
+		page_pool_unreg_netdev_reparent(netdev);
+	mutex_unlock(&page_pools_lock);
+
 	return NOTIFY_OK;
 }