Message ID | 20240806151618.1373008-1-kuba@kernel.org (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,net] net: make page pool stall netdev unregistration to avoid IOMMU crashes | expand |
I tested this patch, have the same crash as I reported... On 2024/8/6 23:16, Jakub Kicinski wrote: > There appears to be no clean way to hold onto the IOMMU, so page pool > cannot outlast the driver which created it. We have no way to stall > the driver unregister, but we can use netdev unregistration as a proxy. > > Note that page pool pages may last forever, we have seen it happen > e.g. when application leaks a socket and page is stuck in its rcv queue. > Hopefully this is fine in this particular case, as we will only stall > unregistering of devices which want the page pool to manage the DMA > mapping for them, i.e. HW backed netdevs. And obviously keeping > the netdev around is preferable to a crash. > > More work is needed for weird drivers which share one pool among > multiple netdevs, as they are not allowed to set the pp->netdev > pointer. We probably need to add a bit that says "don't expose > to uAPI for them". > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > --- > Untested, but I think it would work.. if it's not too controversial. > > CC: Jesper Dangaard Brouer <hawk@kernel.org> > CC: Alexander Duyck <alexander.duyck@gmail.com> > CC: Yonglong Liu <liuyonglong@huawei.com> > CC: Yunsheng Lin <linyunsheng@huawei.com> > --- > include/linux/netdevice.h | 4 ++++ > net/core/page_pool_user.c | 44 +++++++++++++++++++++++++++++++-------- > 2 files changed, 39 insertions(+), 9 deletions(-) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 0ef3eaa23f4b..c817bde7bacc 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -2342,6 +2342,8 @@ struct net_device { > struct lock_class_key *qdisc_tx_busylock; > bool proto_down; > bool threaded; > + /** @pp_unreg_pending: page pool code is stalling unregister */ > + bool pp_unreg_pending; > > struct list_head net_notifier_list; > > @@ -2371,6 +2373,8 @@ struct net_device { > #if IS_ENABLED(CONFIG_PAGE_POOL) > /** @page_pools: page pools created for this netdevice */ > struct hlist_head page_pools; > + /** @pp_dev_tracker: ref tracker for page pool code stalling unreg */ > + netdevice_tracker pp_dev_tracker; > #endif > > /** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */ > diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c > index 3a3277ba167b..1a4135f01130 100644 > --- a/net/core/page_pool_user.c > +++ b/net/core/page_pool_user.c > @@ -349,22 +349,36 @@ static void page_pool_unreg_netdev_wipe(struct net_device *netdev) > struct page_pool *pool; > struct hlist_node *n; > > - mutex_lock(&page_pools_lock); > hlist_for_each_entry_safe(pool, n, &netdev->page_pools, user.list) { > hlist_del_init(&pool->user.list); > pool->slow.netdev = NET_PTR_POISON; > } > - mutex_unlock(&page_pools_lock); > } > > -static void page_pool_unreg_netdev(struct net_device *netdev) > +static void page_pool_unreg_netdev_stall(struct net_device *netdev) > +{ > + if (!netdev->pp_unreg_pending) { > + netdev_hold(netdev, &netdev->pp_dev_tracker, GFP_KERNEL); > + netdev->pp_unreg_pending = true; > + } else { > + netdev_warn(netdev, > + "page pool release stalling device unregister"); > + } > +} > + > +static void page_pool_unreg_netdev_unstall(struct net_device *netdev) > +{ > + netdev_put(netdev, &netdev->pp_dev_tracker); > + netdev->pp_unreg_pending = false; > +} > + > +static void page_pool_unreg_netdev_reparent(struct net_device *netdev) > { > struct page_pool *pool, *last; > struct net_device *lo; > > lo = dev_net(netdev)->loopback_dev; > > - mutex_lock(&page_pools_lock); > last = NULL; > hlist_for_each_entry(pool, &netdev->page_pools, user.list) { > pool->slow.netdev = lo; > @@ -375,7 +389,6 @@ static void page_pool_unreg_netdev(struct net_device *netdev) > if (last) > hlist_splice_init(&netdev->page_pools, &last->user.list, > &lo->page_pools); > - mutex_unlock(&page_pools_lock); > } > > static int > @@ -383,17 +396,30 @@ page_pool_netdevice_event(struct notifier_block *nb, > unsigned long event, void *ptr) > { > struct net_device *netdev = netdev_notifier_info_to_dev(ptr); > + struct page_pool *pool; > + bool has_dma; > > if (event != NETDEV_UNREGISTER) > return NOTIFY_DONE; > > - if (hlist_empty(&netdev->page_pools)) > + if (hlist_empty(&netdev->page_pools) && !netdev->pp_unreg_pending) > return NOTIFY_OK; > > - if (netdev->ifindex != LOOPBACK_IFINDEX) > - page_pool_unreg_netdev(netdev); > - else > + mutex_lock(&page_pools_lock); > + has_dma = false; > + hlist_for_each_entry(pool, &netdev->page_pools, user.list) > + has_dma |= pool->slow.flags & PP_FLAG_DMA_MAP; > + > + if (has_dma) > + page_pool_unreg_netdev_stall(netdev); > + else if (netdev->pp_unreg_pending) > + page_pool_unreg_netdev_unstall(netdev); > + else if (netdev->ifindex == LOOPBACK_IFINDEX) > page_pool_unreg_netdev_wipe(netdev); > + else /* driver doesn't let page pools manage DMA addrs */ > + page_pool_unreg_netdev_reparent(netdev); > + mutex_unlock(&page_pools_lock); > + > return NOTIFY_OK; > } >
On 2024/8/6 23:16, Jakub Kicinski wrote: > There appears to be no clean way to hold onto the IOMMU, so page pool > cannot outlast the driver which created it. We have no way to stall > the driver unregister, but we can use netdev unregistration as a proxy. > > Note that page pool pages may last forever, we have seen it happen > e.g. when application leaks a socket and page is stuck in its rcv queue. We saw some page_pool pages might last forever too, but were not sure if it was the same reason as above? Are there some cmds/ways to debug if a application leaks a socket and page is stuck in its rcv queue? > Hopefully this is fine in this particular case, as we will only stall > unregistering of devices which want the page pool to manage the DMA > mapping for them, i.e. HW backed netdevs. And obviously keeping > the netdev around is preferable to a crash. > > More work is needed for weird drivers which share one pool among > multiple netdevs, as they are not allowed to set the pp->netdev > pointer. We probably need to add a bit that says "don't expose > to uAPI for them". >
On Wed, 7 Aug 2024 15:03:06 +0800 Yonglong Liu wrote:
> I tested this patch, have the same crash as I reported...
Which driver, this is only gonna work if the driver hooks the netdev
to the page pool. Which is still rare for drivers lacking community
involvement.
On Wed, 7 Aug 2024 19:00:35 +0800 Yunsheng Lin wrote: > > Note that page pool pages may last forever, we have seen it happen > > e.g. when application leaks a socket and page is stuck in its rcv queue. > > We saw some page_pool pages might last forever too, but were not sure > if it was the same reason as above? Are there some cmds/ways to debug > if a application leaks a socket and page is stuck in its rcv queue? I used drgn to scan all sockets to find the page. > > Hopefully this is fine in this particular case, as we will only stall > > unregistering of devices which want the page pool to manage the DMA > > mapping for them, i.e. HW backed netdevs. And obviously keeping > > the netdev around is preferable to a crash.
On 2024/8/7 22:27, Jakub Kicinski wrote: > On Wed, 7 Aug 2024 15:03:06 +0800 Yonglong Liu wrote: >> I tested this patch, have the same crash as I reported... > Which driver, this is only gonna work if the driver hooks the netdev > to the page pool. Which is still rare for drivers lacking community > involvement. hns3 driver, we didn't hooks the netdev to the page pool, will try this later, thanks : )
Hi Jakub, On Tue, 6 Aug 2024 at 18:16, Jakub Kicinski <kuba@kernel.org> wrote: > > There appears to be no clean way to hold onto the IOMMU, so page pool > cannot outlast the driver which created it. We have no way to stall > the driver unregister, but we can use netdev unregistration as a proxy. Isn't the inflight machinery enough? Looking at the page_pool_destroy() path, we eventually call page_pool_return_page() which will try to unmap memory. That won't be called if we have inflight packets. In any case why do you want to hold on the IOMMU? The network interface -- at least in theory -- should be down and we wont be processing any more packets. Regards /Ilias > > Note that page pool pages may last forever, we have seen it happen > e.g. when application leaks a socket and page is stuck in its rcv queue. > Hopefully this is fine in this particular case, as we will only stall > unregistering of devices which want the page pool to manage the DMA > mapping for them, i.e. HW backed netdevs. And obviously keeping > the netdev around is preferable to a crash. > > More work is needed for weird drivers which share one pool among > multiple netdevs, as they are not allowed to set the pp->netdev > pointer. We probably need to add a bit that says "don't expose > to uAPI for them". > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > --- > Untested, but I think it would work.. if it's not too controversial. > > CC: Jesper Dangaard Brouer <hawk@kernel.org> > CC: Alexander Duyck <alexander.duyck@gmail.com> > CC: Yonglong Liu <liuyonglong@huawei.com> > CC: Yunsheng Lin <linyunsheng@huawei.com> > --- > include/linux/netdevice.h | 4 ++++ > net/core/page_pool_user.c | 44 +++++++++++++++++++++++++++++++-------- > 2 files changed, 39 insertions(+), 9 deletions(-) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 0ef3eaa23f4b..c817bde7bacc 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -2342,6 +2342,8 @@ struct net_device { > struct lock_class_key *qdisc_tx_busylock; > bool proto_down; > bool threaded; > + /** @pp_unreg_pending: page pool code is stalling unregister */ > + bool pp_unreg_pending; > > struct list_head net_notifier_list; > > @@ -2371,6 +2373,8 @@ struct net_device { > #if IS_ENABLED(CONFIG_PAGE_POOL) > /** @page_pools: page pools created for this netdevice */ > struct hlist_head page_pools; > + /** @pp_dev_tracker: ref tracker for page pool code stalling unreg */ > + netdevice_tracker pp_dev_tracker; > #endif > > /** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */ > diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c > index 3a3277ba167b..1a4135f01130 100644 > --- a/net/core/page_pool_user.c > +++ b/net/core/page_pool_user.c > @@ -349,22 +349,36 @@ static void page_pool_unreg_netdev_wipe(struct net_device *netdev) > struct page_pool *pool; > struct hlist_node *n; > > - mutex_lock(&page_pools_lock); > hlist_for_each_entry_safe(pool, n, &netdev->page_pools, user.list) { > hlist_del_init(&pool->user.list); > pool->slow.netdev = NET_PTR_POISON; > } > - mutex_unlock(&page_pools_lock); > } > > -static void page_pool_unreg_netdev(struct net_device *netdev) > +static void page_pool_unreg_netdev_stall(struct net_device *netdev) > +{ > + if (!netdev->pp_unreg_pending) { > + netdev_hold(netdev, &netdev->pp_dev_tracker, GFP_KERNEL); > + netdev->pp_unreg_pending = true; > + } else { > + netdev_warn(netdev, > + "page pool release stalling device unregister"); > + } > +} > + > +static void page_pool_unreg_netdev_unstall(struct net_device *netdev) > +{ > + netdev_put(netdev, &netdev->pp_dev_tracker); > + netdev->pp_unreg_pending = false; > +} > + > +static void page_pool_unreg_netdev_reparent(struct net_device *netdev) > { > struct page_pool *pool, *last; > struct net_device *lo; > > lo = dev_net(netdev)->loopback_dev; > > - mutex_lock(&page_pools_lock); > last = NULL; > hlist_for_each_entry(pool, &netdev->page_pools, user.list) { > pool->slow.netdev = lo; > @@ -375,7 +389,6 @@ static void page_pool_unreg_netdev(struct net_device *netdev) > if (last) > hlist_splice_init(&netdev->page_pools, &last->user.list, > &lo->page_pools); > - mutex_unlock(&page_pools_lock); > } > > static int > @@ -383,17 +396,30 @@ page_pool_netdevice_event(struct notifier_block *nb, > unsigned long event, void *ptr) > { > struct net_device *netdev = netdev_notifier_info_to_dev(ptr); > + struct page_pool *pool; > + bool has_dma; > > if (event != NETDEV_UNREGISTER) > return NOTIFY_DONE; > > - if (hlist_empty(&netdev->page_pools)) > + if (hlist_empty(&netdev->page_pools) && !netdev->pp_unreg_pending) > return NOTIFY_OK; > > - if (netdev->ifindex != LOOPBACK_IFINDEX) > - page_pool_unreg_netdev(netdev); > - else > + mutex_lock(&page_pools_lock); > + has_dma = false; > + hlist_for_each_entry(pool, &netdev->page_pools, user.list) > + has_dma |= pool->slow.flags & PP_FLAG_DMA_MAP; > + > + if (has_dma) > + page_pool_unreg_netdev_stall(netdev); > + else if (netdev->pp_unreg_pending) > + page_pool_unreg_netdev_unstall(netdev); > + else if (netdev->ifindex == LOOPBACK_IFINDEX) > page_pool_unreg_netdev_wipe(netdev); > + else /* driver doesn't let page pools manage DMA addrs */ > + page_pool_unreg_netdev_reparent(netdev); > + mutex_unlock(&page_pools_lock); > + > return NOTIFY_OK; > } > > -- > 2.45.2 >
On 2024/8/7 22:29, Jakub Kicinski wrote: > On Wed, 7 Aug 2024 19:00:35 +0800 Yunsheng Lin wrote: >>> Note that page pool pages may last forever, we have seen it happen >>> e.g. when application leaks a socket and page is stuck in its rcv queue. >> We saw some page_pool pages might last forever too, but were not sure >> if it was the same reason as above? Are there some cmds/ways to debug >> if a application leaks a socket and page is stuck in its rcv queue? > I used drgn to scan all sockets to find the page. I hooks the netdev to the page pool, and run with this patch for a while, then get the following messages, and the vf can not disable: [ 1950.137586] hns3 0000:7d:01.0 eno1v0: link up [ 1950.137671] hns3 0000:7d:01.0 eno1v0: net open [ 1950.147098] 8021q: adding VLAN 0 to HW filter on device eno1v0 [ 1974.287476] hns3 0000:7d:01.0 eno1v0: net stop [ 1974.294359] hns3 0000:7d:01.0 eno1v0: link down [ 1975.596916] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1976.744947] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1977.900916] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1979.080929] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1980.236914] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1981.384913] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1982.568918] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1983.720912] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1984.584941] unregister_netdevice: waiting for eno1v0 to become free. Usage count = 2 [ 1984.872930] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1986.024924] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1987.176927] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1988.328922] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1989.480917] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1990.632913] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1991.784915] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister ... [ 8640.008931] unregister_netdevice: waiting for eno1v0 to become free. Usage count = 2 [ 8640.300912] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 8641.452910] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 8642.600939] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 8643.756914] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 8644.904922] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 8646.060910] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 8647.208909] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 8648.360931] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister I install drgn, but don't know how to find out the using pages, would you guide me on how to use it? >>> Hopefully this is fine in this particular case, as we will only stall >>> unregistering of devices which want the page pool to manage the DMA >>> mapping for them, i.e. HW backed netdevs. And obviously keeping >>> the netdev around is preferable to a crash.
On Thu, 8 Aug 2024 14:12:34 +0300 Ilias Apalodimas wrote: > In any case why do you want to hold on the IOMMU? The network > interface -- at least in theory -- should be down and we wont be > processing any more packets. I should have you a link to Yonglong's report: https://lore.kernel.org/all/8743264a-9700-4227-a556-5f931c720211@huawei.com/ we get_device() hoping that it will keep the IOMMU machinery active (even if the device won't use the page we need to unmap it when it's freed), but it sounds like IOMMU dies when driver is unbound. Even if there are outstanding references to the device. I occasionally hit this problem reloading drivers during development, TBH, too. And we have been told we "use the API wrong" so let's fix it on our end?..
On Thu, 8 Aug 2024 20:52:52 +0800 Yonglong Liu wrote: > I hooks the netdev to the page pool, and run with this patch for a > while, then get > > the following messages, and the vf can not disable: > [ 1950.137586] hns3 0000:7d:01.0 eno1v0: link up > [ 1950.137671] hns3 0000:7d:01.0 eno1v0: net open > [ 1950.147098] 8021q: adding VLAN 0 to HW filter on device eno1v0 > [ 1974.287476] hns3 0000:7d:01.0 eno1v0: net stop > [ 1974.294359] hns3 0000:7d:01.0 eno1v0: link down > [ 1975.596916] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool > release stalling device unregister > [ 1976.744947] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool > release stalling device unregister So.. the patch works? :) We may want to add this to get the info prints back: diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 2abe6e919224..26bc1618de7c 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -1021,11 +1021,12 @@ static void page_pool_release_retry(struct work_struct *wq) /* Periodic warning for page pools the user can't see */ netdev = READ_ONCE(pool->slow.netdev); if (time_after_eq(jiffies, pool->defer_warn) && - (!netdev || netdev == NET_PTR_POISON)) { + (!netdev || netdev == NET_PTR_POISON || netdev->pp_unreg_pending)) { int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ; - pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n", - __func__, pool->user.id, inflight, sec); + pr_warn("%s(): %s stalled pool shutdown: id %u, %d inflight %d sec (hold netdev: %d)\n", + __func__, netdev ? netdev_name(netdev) : "", + pool->user.id, inflight, sec, pool->defer_warn); pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; } > I install drgn, but don't know how to find out the using pages, would > you guide me on how to use it? You can use this sample as a starting point: https://github.com/osandov/drgn/blob/main/contrib/tcp_sock.py but if the pages are actually leaked (rather than sitting in a socket), you'll have to scan pages, not sockets. And figure out how they got leaked. Somehow...
Hi Jakub, On Thu, 8 Aug 2024 at 16:52, Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 8 Aug 2024 14:12:34 +0300 Ilias Apalodimas wrote: > > In any case why do you want to hold on the IOMMU? The network > > interface -- at least in theory -- should be down and we wont be > > processing any more packets. > > I should have you a link to Yonglong's report: > > https://lore.kernel.org/all/8743264a-9700-4227-a556-5f931c720211@huawei.com/ Ah I read that bug report during my vacation but completely forgot about it... Thanks this helps. > > we get_device() hoping that it will keep the IOMMU machinery active > (even if the device won't use the page we need to unmap it when it's > freed), but it sounds like IOMMU dies when driver is unbound. Even if > there are outstanding references to the device. > > I occasionally hit this problem reloading drivers during development, > TBH, too. And we have been told we "use the API wrong" so let's fix > it on our end?.. It's been a while since I looked at the use cases, but I don't love the idea of stalling the netdev removal until sockets process all packets. There's a chance that the device will stay there forever. I'll have to take a closer look but the first thing that comes to mind is to unmap the pages early, before page_pool_destroy() is called and perhaps add a flag that says "the pool is there only to process existing packets, but you can't DMA into it anymore". Thanks /Ilias
On Thu, 8 Aug 2024 17:30:31 +0300 Ilias Apalodimas wrote: > > we get_device() hoping that it will keep the IOMMU machinery active > > (even if the device won't use the page we need to unmap it when it's > > freed), but it sounds like IOMMU dies when driver is unbound. Even if > > there are outstanding references to the device. > > > > I occasionally hit this problem reloading drivers during development, > > TBH, too. And we have been told we "use the API wrong" so let's fix > > it on our end?.. > > It's been a while since I looked at the use cases, but I don't love > the idea of stalling the netdev removal until sockets process all > packets. There's a chance that the device will stay there forever. True, my thinking is that there are 3 cases: - good case, nothing gets stalled - pages held, no IOMMU, we may make it worse, user doesn't care - pages held, IOMMU enabled, it would have crashed given that we get so few reports about the third one, I tend towards thinking that the risk of stall is somewhat limited. > I'll have to take a closer look but the first thing that comes to mind > is to unmap the pages early, before page_pool_destroy() is called and > perhaps add a flag that says "the pool is there only to process > existing packets, but you can't DMA into it anymore". Yeah, but we need to find them... Maybe Alex knows how many rotten veg will be thrown our way if we try to scan all struct pages, IDK if that's considered acceptable.
On 2024/8/8 22:05, Jakub Kicinski wrote: > On Thu, 8 Aug 2024 20:52:52 +0800 Yonglong Liu wrote: >> I hooks the netdev to the page pool, and run with this patch for a >> while, then get >> >> the following messages, and the vf can not disable: >> [ 1950.137586] hns3 0000:7d:01.0 eno1v0: link up >> [ 1950.137671] hns3 0000:7d:01.0 eno1v0: net open >> [ 1950.147098] 8021q: adding VLAN 0 to HW filter on device eno1v0 >> [ 1974.287476] hns3 0000:7d:01.0 eno1v0: net stop >> [ 1974.294359] hns3 0000:7d:01.0 eno1v0: link down >> [ 1975.596916] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool >> release stalling device unregister >> [ 1976.744947] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool >> release stalling device unregister > So.. the patch works? :) We may want to add this to get the info prints > back: > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 2abe6e919224..26bc1618de7c 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -1021,11 +1021,12 @@ static void page_pool_release_retry(struct work_struct *wq) > /* Periodic warning for page pools the user can't see */ > netdev = READ_ONCE(pool->slow.netdev); > if (time_after_eq(jiffies, pool->defer_warn) && > - (!netdev || netdev == NET_PTR_POISON)) { > + (!netdev || netdev == NET_PTR_POISON || netdev->pp_unreg_pending)) { > int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ; > > - pr_warn("%s() stalled pool shutdown: id %u, %d inflight %d sec\n", > - __func__, pool->user.id, inflight, sec); > + pr_warn("%s(): %s stalled pool shutdown: id %u, %d inflight %d sec (hold netdev: %d)\n", > + __func__, netdev ? netdev_name(netdev) : "", > + pool->user.id, inflight, sec, pool->defer_warn); > pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; > } > Here is the log: [ 1018.059215] hns3 0000:7d:01.0 eno1v0: net stop [ 1018.066095] hns3 0000:7d:01.0 eno1v0: link down [ 1019.340845] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1020.492848] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1021.648849] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1022.796850] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1023.980837] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1025.132850] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1026.284851] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1027.500853] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1028.364855] unregister_netdevice: waiting for eno1v0 to become free. Usage count = 2 [ 1028.652851] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1029.808845] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister [ 1030.956843] hns3 0000:7d:01.0 eno1v0 (unregistered): page pool release stalling device unregister ... [ 1078.476854] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 stalled pool shutdown: id 553, 82 inflight 60 sec (hold netdev: 194039) ... [ 1138.892838] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 stalled pool shutdown: id 553, 82 inflight 120 sec (hold netdev: 209147) ... [ 1199.308841] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 stalled pool shutdown: id 553, 82 inflight 181 sec (hold netdev: 224251) ... [ 1199.308841] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 stalled pool shutdown: id 553, 82 inflight 181 sec (hold netdev: 224251) ... [ 1259.724849] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 stalled pool shutdown: id 553, 82 inflight 241 sec (hold netdev: 239355) ... [ 7603.436840] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 stalled pool shutdown: id 553, 82 inflight 6585 sec (hold netdev: 1825283) [ 7663.852858] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 stalled pool shutdown: id 553, 82 inflight 6645 sec (hold netdev: 1840387) [ 7724.272853] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 stalled pool shutdown: id 553, 82 inflight 6706 sec (hold netdev: 1855491) And also have the follow INFO (last time I forgot this): [ 1211.213006] INFO: task systemd-journal:1598 blocked for more than 120 seconds. [ 1211.220217] Not tainted 6.10.0-rc4+ #1 [ 1211.224480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1211.232284] task:systemd-journal state:D stack:0 pid:1598 tgid:1598 ppid:1 flags:0x00000801 [ 1211.241568] Call trace: [ 1211.244005] __switch_to+0xec/0x138 [ 1211.247509] __schedule+0x2f4/0x10e0 [ 1211.251080] schedule+0x3c/0x148 [ 1211.254308] schedule_preempt_disabled+0x2c/0x50 [ 1211.258927] __mutex_lock.constprop.0+0x2b0/0x618 [ 1211.263625] __mutex_lock_slowpath+0x1c/0x30 [ 1211.267888] mutex_lock+0x40/0x58 [ 1211.271203] uevent_show+0x90/0x140 [ 1211.274687] dev_attr_show+0x28/0x80 [ 1211.278265] sysfs_kf_seq_show+0xb4/0x138 [ 1211.282277] kernfs_seq_show+0x34/0x48 [ 1211.286029] seq_read_iter+0x1bc/0x4b8 [ 1211.289780] kernfs_fop_read_iter+0x148/0x1c8 [ 1211.294128] vfs_read+0x25c/0x308 [ 1211.297443] ksys_read+0x70/0x108 [ 1211.300745] __arm64_sys_read+0x24/0x38 [ 1211.304581] invoke_syscall+0x50/0x128 [ 1211.308331] el0_svc_common.constprop.0+0xc8/0xf0 [ 1211.313023] do_el0_svc+0x24/0x38 [ 1211.316326] el0_svc+0x38/0x100 [ 1211.319470] el0t_64_sync_handler+0xc0/0xc8 [ 1211.323647] el0t_64_sync+0x1a4/0x1a8 >> I install drgn, but don't know how to find out the using pages, would >> you guide me on how to use it? > You can use this sample as a starting point: > > https://github.com/osandov/drgn/blob/main/contrib/tcp_sock.py > > but if the pages are actually leaked (rather than sitting in a socket), > you'll have to scan pages, not sockets. And figure out how they got leaked. > Somehow... Thanks : )
On Fri, 9 Aug 2024 14:06:02 +0800 Yonglong Liu wrote: > [ 7724.272853] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 > stalled pool shutdown: id 553, 82 inflight 6706 sec (hold netdev: 1855491) Alright :( You gotta look around for those 82 pages somehow with drgn. bpftrace+kfunc the work that does the periodic print to get the address of the page pool struct and then look around for pages from that pp.. :(
On 2024/8/10 11:57, Jakub Kicinski wrote: > On Fri, 9 Aug 2024 14:06:02 +0800 Yonglong Liu wrote: >> [ 7724.272853] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 >> stalled pool shutdown: id 553, 82 inflight 6706 sec (hold netdev: 1855491) > Alright :( You gotta look around for those 82 pages somehow with drgn. > bpftrace+kfunc the work that does the periodic print to get the address > of the page pool struct and then look around for pages from that pp.. :( I spent some time to learn how to use the drgn, and found those page, but I think those page is allocated by the hns3 driver, how to find out who own those page now? The following is a page info: *(struct page *)0xfffffe004359de80 = { .flags = (unsigned long)432345014471753728, .lru = (struct list_head){ .next = (struct list_head *)0xdead000000000040, .prev = (struct list_head *)0xffff002105c18000, }, .__filler = (void *)0xdead000000000040, .mlock_count = (unsigned int)96567296, .buddy_list = (struct list_head){ .next = (struct list_head *)0xdead000000000040, .prev = (struct list_head *)0xffff002105c18000, }, .pcp_list = (struct list_head){ .next = (struct list_head *)0xdead000000000040, .prev = (struct list_head *)0xffff002105c18000, }, .mapping = (struct address_space *)0x0, .index = (unsigned long)4223168512, .share = (unsigned long)4223168512, .private = (unsigned long)2, .pp_magic = (unsigned long)16045481047390945344, .pp = (struct page_pool *)0xffff002105c18000, ._pp_mapping_pad = (unsigned long)0, .dma_addr = (unsigned long)4223168512, .pp_ref_count = (atomic_long_t){ .counter = (s64)2, }, .compound_head = (unsigned long)16045481047390945344, .pgmap = (struct dev_pagemap *)0xdead000000000040, .zone_device_data = (void *)0xffff002105c18000, .callback_head = (struct callback_head){ .next = (struct callback_head *)0xdead000000000040, .func = (void (*)(struct callback_head *))0xffff002105c18000, }, ._mapcount = (atomic_t){ .counter = (int)-1, }, .page_type = (unsigned int)4294967295, ._refcount = (atomic_t){ .counter = (int)1, }, .memcg_data = (unsigned long)0, }
On Wed, 14 Aug 2024 18:09:59 +0800 Yonglong Liu wrote: > On 2024/8/10 11:57, Jakub Kicinski wrote: > > On Fri, 9 Aug 2024 14:06:02 +0800 Yonglong Liu wrote: > >> [ 7724.272853] hns3 0000:7d:01.0: page_pool_release_retry(): eno1v0 > >> stalled pool shutdown: id 553, 82 inflight 6706 sec (hold netdev: 1855491) > > Alright :( You gotta look around for those 82 pages somehow with drgn. > > bpftrace+kfunc the work that does the periodic print to get the address > > of the page pool struct and then look around for pages from that pp.. :( > > I spent some time to learn how to use the drgn, and found those page, > but I think those page > > is allocated by the hns3 driver, how to find out who own those page now? Scan the entire system memory looking for the pointer to this page. Dump the memory around location which hold that pointer. If you're lucky the page will be held by an skb, and the memory around it will look like struct skb_shared_info. If you're less lucky the page is used by sk_buff for the head and address will not be exact. If you're less lucky still the page will be directly leaked by the driver, and not pointed to by anything... I think the last case is most likely, FWIW.
On 2024/8/6 23:16, Jakub Kicinski wrote: > There appears to be no clean way to hold onto the IOMMU, so page pool > cannot outlast the driver which created it. We have no way to stall > the driver unregister, but we can use netdev unregistration as a proxy. > > Note that page pool pages may last forever, we have seen it happen > e.g. when application leaks a socket and page is stuck in its rcv queue. I am assuming the page will be released when the application dies or exits, right? Also I am not sure if the above application is privileged one or not? If it is not a privileged one, perhaps we need to fix the above problem in the kernel as it does not seem to make sense for a unprivileged application to have the kernel leaking page and stall the unregistering of devices. > Hopefully this is fine in this particular case, as we will only stall > unregistering of devices which want the page pool to manage the DMA > mapping for them, i.e. HW backed netdevs. And obviously keeping > the netdev around is preferable to a crash. For the internal testing and debugging, it seems there are at least two cases that the page is not released fast enough for now: 1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30 secs: #define IP_FRAG_TIME (30 * HZ) /* fragment lifetime */ 2. skb_defer_free_flush(): this may cause infinite delay if there is no triggering for net_rx_action(). Below is the dump_stack() when the page is returned back to page_pool after reloading the driver, causing the triggering of net_rx_action(): [ 515.286580] Call trace: [ 515.289012] dump_backtrace+0x9c/0x100 [ 515.292748] show_stack+0x20/0x38 [ 515.296049] dump_stack_lvl+0x78/0x90 [ 515.299699] dump_stack+0x18/0x28 [ 515.303001] page_pool_put_unrefed_netmem+0x2c4/0x3d0 [ 515.308039] napi_pp_put_page+0xb4/0xe0 [ 515.311863] skb_release_data+0xf8/0x1e0 [ 515.315772] kfree_skb_list_reason+0xb4/0x2a0 [ 515.320115] skb_release_data+0x148/0x1e0 [ 515.324111] napi_consume_skb+0x64/0x190 [ 515.328021] net_rx_action+0x110/0x2a8 [ 515.331758] handle_softirqs+0x120/0x368 [ 515.335668] __do_softirq+0x1c/0x28 [ 515.339143] ____do_softirq+0x18/0x30 [ 515.342792] call_on_irq_stack+0x24/0x58 [ 515.346701] do_softirq_own_stack+0x24/0x38 [ 515.350871] irq_exit_rcu+0x94/0xd0 [ 515.354347] el1_interrupt+0x38/0x68 [ 515.357910] el1h_64_irq_handler+0x18/0x28 [ 515.361994] el1h_64_irq+0x64/0x68 [ 515.365382] default_idle_call+0x34/0x140 [ 515.369378] do_idle+0x20c/0x270 [ 515.372593] cpu_startup_entry+0x40/0x50 [ 515.376503] secondary_start_kernel+0x138/0x160 [ 515.381021] __secondary_switched+0xb8/0xc0 > > More work is needed for weird drivers which share one pool among > multiple netdevs, as they are not allowed to set the pp->netdev > pointer. We probably need to add a bit that says "don't expose > to uAPI for them". Which driver are we talking about here sharing one pool among multiple netdevs? Is the sharing for memory saving? >
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 0ef3eaa23f4b..c817bde7bacc 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2342,6 +2342,8 @@ struct net_device { struct lock_class_key *qdisc_tx_busylock; bool proto_down; bool threaded; + /** @pp_unreg_pending: page pool code is stalling unregister */ + bool pp_unreg_pending; struct list_head net_notifier_list; @@ -2371,6 +2373,8 @@ struct net_device { #if IS_ENABLED(CONFIG_PAGE_POOL) /** @page_pools: page pools created for this netdevice */ struct hlist_head page_pools; + /** @pp_dev_tracker: ref tracker for page pool code stalling unreg */ + netdevice_tracker pp_dev_tracker; #endif /** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */ diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c index 3a3277ba167b..1a4135f01130 100644 --- a/net/core/page_pool_user.c +++ b/net/core/page_pool_user.c @@ -349,22 +349,36 @@ static void page_pool_unreg_netdev_wipe(struct net_device *netdev) struct page_pool *pool; struct hlist_node *n; - mutex_lock(&page_pools_lock); hlist_for_each_entry_safe(pool, n, &netdev->page_pools, user.list) { hlist_del_init(&pool->user.list); pool->slow.netdev = NET_PTR_POISON; } - mutex_unlock(&page_pools_lock); } -static void page_pool_unreg_netdev(struct net_device *netdev) +static void page_pool_unreg_netdev_stall(struct net_device *netdev) +{ + if (!netdev->pp_unreg_pending) { + netdev_hold(netdev, &netdev->pp_dev_tracker, GFP_KERNEL); + netdev->pp_unreg_pending = true; + } else { + netdev_warn(netdev, + "page pool release stalling device unregister"); + } +} + +static void page_pool_unreg_netdev_unstall(struct net_device *netdev) +{ + netdev_put(netdev, &netdev->pp_dev_tracker); + netdev->pp_unreg_pending = false; +} + +static void page_pool_unreg_netdev_reparent(struct net_device *netdev) { struct page_pool *pool, *last; struct net_device *lo; lo = dev_net(netdev)->loopback_dev; - mutex_lock(&page_pools_lock); last = NULL; hlist_for_each_entry(pool, &netdev->page_pools, user.list) { pool->slow.netdev = lo; @@ -375,7 +389,6 @@ static void page_pool_unreg_netdev(struct net_device *netdev) if (last) hlist_splice_init(&netdev->page_pools, &last->user.list, &lo->page_pools); - mutex_unlock(&page_pools_lock); } static int @@ -383,17 +396,30 @@ page_pool_netdevice_event(struct notifier_block *nb, unsigned long event, void *ptr) { struct net_device *netdev = netdev_notifier_info_to_dev(ptr); + struct page_pool *pool; + bool has_dma; if (event != NETDEV_UNREGISTER) return NOTIFY_DONE; - if (hlist_empty(&netdev->page_pools)) + if (hlist_empty(&netdev->page_pools) && !netdev->pp_unreg_pending) return NOTIFY_OK; - if (netdev->ifindex != LOOPBACK_IFINDEX) - page_pool_unreg_netdev(netdev); - else + mutex_lock(&page_pools_lock); + has_dma = false; + hlist_for_each_entry(pool, &netdev->page_pools, user.list) + has_dma |= pool->slow.flags & PP_FLAG_DMA_MAP; + + if (has_dma) + page_pool_unreg_netdev_stall(netdev); + else if (netdev->pp_unreg_pending) + page_pool_unreg_netdev_unstall(netdev); + else if (netdev->ifindex == LOOPBACK_IFINDEX) page_pool_unreg_netdev_wipe(netdev); + else /* driver doesn't let page pools manage DMA addrs */ + page_pool_unreg_netdev_reparent(netdev); + mutex_unlock(&page_pools_lock); + return NOTIFY_OK; }
There appears to be no clean way to hold onto the IOMMU, so page pool cannot outlast the driver which created it. We have no way to stall the driver unregister, but we can use netdev unregistration as a proxy. Note that page pool pages may last forever, we have seen it happen e.g. when application leaks a socket and page is stuck in its rcv queue. Hopefully this is fine in this particular case, as we will only stall unregistering of devices which want the page pool to manage the DMA mapping for them, i.e. HW backed netdevs. And obviously keeping the netdev around is preferable to a crash. More work is needed for weird drivers which share one pool among multiple netdevs, as they are not allowed to set the pp->netdev pointer. We probably need to add a bit that says "don't expose to uAPI for them". Signed-off-by: Jakub Kicinski <kuba@kernel.org> --- Untested, but I think it would work.. if it's not too controversial. CC: Jesper Dangaard Brouer <hawk@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Yonglong Liu <liuyonglong@huawei.com> CC: Yunsheng Lin <linyunsheng@huawei.com> --- include/linux/netdevice.h | 4 ++++ net/core/page_pool_user.c | 44 +++++++++++++++++++++++++++++++-------- 2 files changed, 39 insertions(+), 9 deletions(-)