diff mbox series

[1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()

Message ID 20190304213357.16652-2-decui@microsoft.com (mailing list archive)
State Accepted, archived
Commit 05f151a73ec2b23ffbff706e5203e729a995cdc2
Headers show
Series pci-hyperv: fix memory leak and add pci_destroy_slot() | expand

Commit Message

Dexuan Cui March 4, 2019, 9:34 p.m. UTC
After a device is just created in new_pcichild_device(), hpdev->refs is set
to 2 (i.e. the initial value of 1 plus the get_pcichild()).

When we hot remove the device from the host, in Linux VM we first call
hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
(let's ignore the paired get/put_pcichild() in other places). But in
hv_eject_device_work(), currently we only call put_pcichild() twice,
meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
adds one put_pcichild() to fix the memory leak.

BTW, the device can also be removed when we run "rmmod pci-hyperv". On this
path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
hpdev->refs is 2, and we do correctly call put_pcichild() twice in
pci_devices_present_work().

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: <stable@vger.kernel.org>
---
 drivers/pci/controller/pci-hyperv.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Michael Kelley (LINUX) March 20, 2019, 9:37 p.m. UTC | #1
From: Dexuan Cui <decui@microsoft.com>
>
> After a device is just created in new_pcichild_device(), hpdev->refs is set
> to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> 
> When we hot remove the device from the host, in Linux VM we first call
> hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> (let's ignore the paired get/put_pcichild() in other places). But in
> hv_eject_device_work(), currently we only call put_pcichild() twice,
> meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> adds one put_pcichild() to fix the memory leak.
> 
> BTW, the device can also be removed when we run "rmmod pci-hyperv". On this
> path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> pci_devices_present_work().
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: <stable@vger.kernel.org>

Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
There is the reference in the hbus->children list, and there is the reference that
is returned to the caller.  But what is strange is that pci_devices_present_work()
overwrites the reference returned in local variable hpdev without doing a
put_pcichild().  It seems like the "normal" reference count should be 1 when the
child device is not being manipulated, not 2.  The fix would be to add a call to
put_pcichild() when the return value from new_pcichild_device() is overwritten.
Then remove the call to put_pcichild() in pci_device_present_work() when missing
children are moved to the local list. The children have been moved from one list
to another, so there's no need to decrement the reference count.  Then when
everything in the local list is deleted, the reference is correctly decremented,
presumably freeing the memory.

With this approach, the code in hv_eject_device_work() is correct.  There's
one call to put_pcichild() to reflect removing the child device from the hbus->
children list, and one call to put_pcichild() to pair with the get_pcichild() in
hv_pci_eject_device().

Your patch works, but to me it leaves the ref count in an unnatural state
most of the time.

Michael
Dexuan Cui March 21, 2019, 12:12 a.m. UTC | #2
> From: Michael Kelley <mikelley@microsoft.com>
> Sent: Wednesday, March 20, 2019 2:38 PM
> 
> From: Dexuan Cui <decui@microsoft.com>
> >
> > After a device is just created in new_pcichild_device(), hpdev->refs is set
> > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> >
> > When we hot remove the device from the host, in Linux VM we first call
> > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> > then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> > (let's ignore the paired get/put_pcichild() in other places). But in
> > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > adds one put_pcichild() to fix the memory leak.
> >
> > BTW, the device can also be removed when we run "rmmod pci-hyperv". On
> this
> > path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > pci_devices_present_work().
> 
> Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> There is the reference in the hbus->children list, and there is the reference that
> is returned to the caller.  
So IMO the "normal" reference count should be 2. :-) IMO only when a hv_pci_dev
device is about to be destroyed, its reference count can drop to less than 2, 
i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed from
hbus->children), and then drop to zero (meaning kfree(hpdev) is called).

> But what is strange is that pci_devices_present_work()
> overwrites the reference returned in local variable hpdev without doing a
> put_pcichild().
I suppose you mean:

        /* First, mark all existing children as reported missing. */
        spin_lock_irqsave(&hbus->device_list_lock, flags);
        list_for_each_entry(hpdev, &hbus->children, list_entry) {
                hpdev->reported_missing = true;
        }
        spin_unlock_irqrestore(&hbus->device_list_lock, flags)

This is not strange to me, because, in pci_devices_present_work(), at first we
don't know which devices are about to disappear, so we pre-mark all devices to
be potentially missing like that; if a device is still on the bus, we'll mark its
hpdev->reported_missing to false later; only after we know exactly which
devices are missing, we should call put_pcichild() against them. All these
seem natural to me.

> It seems like the "normal" reference count should be 1 when the
> child device is not being manipulated, not 2.
What does "not being manipulated" mean?

> The fix would be to add a call to
> put_pcichild() when the return value from new_pcichild_device() is
> overwritten.
In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" returned
from new_pcichild_device(): the "reported_missing" field of the new hpdev
is implicitly initialized to false in new_pcichild_device().

> Then remove the call to put_pcichild() in pci_device_present_work() when
> missing
> children are moved to the local list. The children have been moved from one
> list
> to another, so there's no need to decrement the reference count.  Then when
> everything in the local list is deleted, the reference is correctly decremented,
> presumably freeing the memory.
> 
> With this approach, the code in hv_eject_device_work() is correct.  There's
> one call to put_pcichild() to reflect removing the child device from the hbus->
> children list, and one call to put_pcichild() to pair with the get_pcichild() in
> hv_pci_eject_device().
Please refer to my replies above. IMO we should fix
hv_eject_device_work() rather than pci_devices_present_work().

Thanks
-- Dexuan
 
> Your patch works, but to me it leaves the ref count in an unnatural state
> most of the time.
> 
> Michael
Lorenzo Pieralisi March 26, 2019, 5:08 p.m. UTC | #3
On Thu, Mar 21, 2019 at 12:12:03AM +0000, Dexuan Cui wrote:
> > From: Michael Kelley <mikelley@microsoft.com>
> > Sent: Wednesday, March 20, 2019 2:38 PM
> > 
> > From: Dexuan Cui <decui@microsoft.com>
> > >
> > > After a device is just created in new_pcichild_device(), hpdev->refs is set
> > > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> > >
> > > When we hot remove the device from the host, in Linux VM we first call
> > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> > > then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> > > (let's ignore the paired get/put_pcichild() in other places). But in
> > > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > > adds one put_pcichild() to fix the memory leak.
> > >
> > > BTW, the device can also be removed when we run "rmmod pci-hyperv". On
> > this
> > > path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > > pci_devices_present_work().
> > 
> > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> > There is the reference in the hbus->children list, and there is the reference that
> > is returned to the caller.  
> So IMO the "normal" reference count should be 2. :-) IMO only when a hv_pci_dev
> device is about to be destroyed, its reference count can drop to less than 2, 
> i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed from
> hbus->children), and then drop to zero (meaning kfree(hpdev) is called).
> 
> > But what is strange is that pci_devices_present_work()
> > overwrites the reference returned in local variable hpdev without doing a
> > put_pcichild().
> I suppose you mean:
> 
>         /* First, mark all existing children as reported missing. */
>         spin_lock_irqsave(&hbus->device_list_lock, flags);
>         list_for_each_entry(hpdev, &hbus->children, list_entry) {
>                 hpdev->reported_missing = true;
>         }
>         spin_unlock_irqrestore(&hbus->device_list_lock, flags)
> 
> This is not strange to me, because, in pci_devices_present_work(), at first we
> don't know which devices are about to disappear, so we pre-mark all devices to
> be potentially missing like that; if a device is still on the bus, we'll mark its
> hpdev->reported_missing to false later; only after we know exactly which
> devices are missing, we should call put_pcichild() against them. All these
> seem natural to me.
> 
> > It seems like the "normal" reference count should be 1 when the
> > child device is not being manipulated, not 2.
> What does "not being manipulated" mean?
> 
> > The fix would be to add a call to
> > put_pcichild() when the return value from new_pcichild_device() is
> > overwritten.
> In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" returned
> from new_pcichild_device(): the "reported_missing" field of the new hpdev
> is implicitly initialized to false in new_pcichild_device().
> 
> > Then remove the call to put_pcichild() in pci_device_present_work() when
> > missing
> > children are moved to the local list. The children have been moved from one
> > list
> > to another, so there's no need to decrement the reference count.  Then when
> > everything in the local list is deleted, the reference is correctly decremented,
> > presumably freeing the memory.
> > 
> > With this approach, the code in hv_eject_device_work() is correct.  There's
> > one call to put_pcichild() to reflect removing the child device from the hbus->
> > children list, and one call to put_pcichild() to pair with the get_pcichild() in
> > hv_pci_eject_device().
> Please refer to my replies above. IMO we should fix
> hv_eject_device_work() rather than pci_devices_present_work().

Have we reached a conclusion on this ? I would like to merge this series
given that it is fixing bugs and it has hung in the balance for quite
a while but it looks like Michael is not too happy about these patches
and I need a maintainer ACK to merge them.

Thanks,
Lorenzo

> Thanks
> -- Dexuan
>  
> > Your patch works, but to me it leaves the ref count in an unnatural state
> > most of the time.
> > 
> > Michael
>
Michael Kelley (LINUX) March 26, 2019, 5:47 p.m. UTC | #4
From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>  Sent: Tuesday, March 26, 2019 10:09 AM
> On Thu, Mar 21, 2019 at 12:12:03AM +0000, Dexuan Cui wrote:
> > > From: Michael Kelley <mikelley@microsoft.com>
> > > Sent: Wednesday, March 20, 2019 2:38 PM
> > >
> > > From: Dexuan Cui <decui@microsoft.com>
> > > >
> > > > After a device is just created in new_pcichild_device(), hpdev->refs is set
> > > > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> > > >
> > > > When we hot remove the device from the host, in Linux VM we first call
> > > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> > > > then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> > > > (let's ignore the paired get/put_pcichild() in other places). But in
> > > > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > > > adds one put_pcichild() to fix the memory leak.
> > > >
> > > > BTW, the device can also be removed when we run "rmmod pci-hyperv". On
> > > this
> > > > path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> > > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > > > pci_devices_present_work().
> > >
> > > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> > > There is the reference in the hbus->children list, and there is the reference that
> > > is returned to the caller.
> > So IMO the "normal" reference count should be 2. :-) IMO only when a hv_pci_dev
> > device is about to be destroyed, its reference count can drop to less than 2,
> > i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed from
> > hbus->children), and then drop to zero (meaning kfree(hpdev) is called).
> >
> > > But what is strange is that pci_devices_present_work()
> > > overwrites the reference returned in local variable hpdev without doing a
> > > put_pcichild().
> > I suppose you mean:
> >
> >         /* First, mark all existing children as reported missing. */
> >         spin_lock_irqsave(&hbus->device_list_lock, flags);
> >         list_for_each_entry(hpdev, &hbus->children, list_entry) {
> >                 hpdev->reported_missing = true;
> >         }
> >         spin_unlock_irqrestore(&hbus->device_list_lock, flags)
> >
> > This is not strange to me, because, in pci_devices_present_work(), at first we
> > don't know which devices are about to disappear, so we pre-mark all devices to
> > be potentially missing like that; if a device is still on the bus, we'll mark its
> > hpdev->reported_missing to false later; only after we know exactly which
> > devices are missing, we should call put_pcichild() against them. All these
> > seem natural to me.
> >
> > > It seems like the "normal" reference count should be 1 when the
> > > child device is not being manipulated, not 2.
> > What does "not being manipulated" mean?
> >
> > > The fix would be to add a call to
> > > put_pcichild() when the return value from new_pcichild_device() is
> > > overwritten.
> > In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" returned
> > from new_pcichild_device(): the "reported_missing" field of the new hpdev
> > is implicitly initialized to false in new_pcichild_device().
> >
> > > Then remove the call to put_pcichild() in pci_device_present_work() when
> > > missing
> > > children are moved to the local list. The children have been moved from one
> > > list
> > > to another, so there's no need to decrement the reference count.  Then when
> > > everything in the local list is deleted, the reference is correctly decremented,
> > > presumably freeing the memory.
> > >
> > > With this approach, the code in hv_eject_device_work() is correct.  There's
> > > one call to put_pcichild() to reflect removing the child device from the hbus->
> > > children list, and one call to put_pcichild() to pair with the get_pcichild() in
> > > hv_pci_eject_device().
> > Please refer to my replies above. IMO we should fix
> > hv_eject_device_work() rather than pci_devices_present_work().
> 
> Have we reached a conclusion on this ? I would like to merge this series
> given that it is fixing bugs and it has hung in the balance for quite
> a while but it looks like Michael is not too happy about these patches
> and I need a maintainer ACK to merge them.
> 
> Thanks,
> Lorenzo

Dexuan and I have discussed the topic extensively offline.  The patch works
in its current form, and I'll agree to it.

Reviewed-by:  Michael Kelley <mikelley@microsoft.com>
Dexuan Cui March 26, 2019, 6:01 p.m. UTC | #5
> From: Michael Kelley <mikelley@microsoft.com>
> Sent: Tuesday, March 26, 2019 10:47 AM
> To: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>; Dexuan Cui
> <decui@microsoft.com>
> Cc: bhelgaas@google.com; linux-pci@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>;
> Sasha Levin <Alexander.Levin@microsoft.com>; linux-hyperv@vger.kernel.org;
> linux-kernel@vger.kernel.org; driverdev-devel@linuxdriverproject.org; Haiyang
> Zhang <haiyangz@microsoft.com>; olaf@aepfle.de; apw@canonical.com;
> jasowang@redhat.com; vkuznets <vkuznets@redhat.com>;
> marcelo.cerri@canonical.com; jackm@mellanox.com; stable@vger.kernel.org
> Subject: RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
> 
> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>  Sent: Tuesday, March 26,
> 2019 10:09 AM
> > On Thu, Mar 21, 2019 at 12:12:03AM +0000, Dexuan Cui wrote:
> > > > From: Michael Kelley <mikelley@microsoft.com>
> > > > Sent: Wednesday, March 20, 2019 2:38 PM
> > > >
> > > > From: Dexuan Cui <decui@microsoft.com>
> > > > >
> > > > > After a device is just created in new_pcichild_device(), hpdev->refs is
> set
> > > > > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> > > > >
> > > > > When we hot remove the device from the host, in Linux VM we first call
> > > > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild()
> and
> > > > > then schedules a work of hv_eject_device_work(), so hpdev->refs
> becomes 3
> > > > > (let's ignore the paired get/put_pcichild() in other places). But in
> > > > > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > > > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > > > > adds one put_pcichild() to fix the memory leak.
> > > > >
> > > > > BTW, the device can also be removed when we run "rmmod pci-hyperv".
> On
> > > > this
> > > > > path (hv_pci_remove() -> hv_pci_bus_exit() ->
> hv_pci_devices_present()),
> > > > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > > > > pci_devices_present_work().
> > > >
> > > > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> > > > There is the reference in the hbus->children list, and there is the
> reference that
> > > > is returned to the caller.
> > > So IMO the "normal" reference count should be 2. :-) IMO only when a
> hv_pci_dev
> > > device is about to be destroyed, its reference count can drop to less than 2,
> > > i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed
> from
> > > hbus->children), and then drop to zero (meaning kfree(hpdev) is called).
> > >
> > > > But what is strange is that pci_devices_present_work()
> > > > overwrites the reference returned in local variable hpdev without doing a
> > > > put_pcichild().
> > > I suppose you mean:
> > >
> > >         /* First, mark all existing children as reported missing. */
> > >         spin_lock_irqsave(&hbus->device_list_lock, flags);
> > >         list_for_each_entry(hpdev, &hbus->children, list_entry) {
> > >                 hpdev->reported_missing = true;
> > >         }
> > >         spin_unlock_irqrestore(&hbus->device_list_lock, flags)
> > >
> > > This is not strange to me, because, in pci_devices_present_work(), at first
> we
> > > don't know which devices are about to disappear, so we pre-mark all
> devices to
> > > be potentially missing like that; if a device is still on the bus, we'll mark its
> > > hpdev->reported_missing to false later; only after we know exactly which
> > > devices are missing, we should call put_pcichild() against them. All these
> > > seem natural to me.
> > >
> > > > It seems like the "normal" reference count should be 1 when the
> > > > child device is not being manipulated, not 2.
> > > What does "not being manipulated" mean?
> > >
> > > > The fix would be to add a call to
> > > > put_pcichild() when the return value from new_pcichild_device() is
> > > > overwritten.
> > > In pci_devices_present_work(), we NEVER "overwrite" the "hpdev"
> returned
> > > from new_pcichild_device(): the "reported_missing" field of the new hpdev
> > > is implicitly initialized to false in new_pcichild_device().
> > >
> > > > Then remove the call to put_pcichild() in pci_device_present_work()
> when
> > > > missing
> > > > children are moved to the local list. The children have been moved from
> one
> > > > list
> > > > to another, so there's no need to decrement the reference count.  Then
> when
> > > > everything in the local list is deleted, the reference is correctly
> decremented,
> > > > presumably freeing the memory.
> > > >
> > > > With this approach, the code in hv_eject_device_work() is correct.
> There's
> > > > one call to put_pcichild() to reflect removing the child device from the
> hbus->
> > > > children list, and one call to put_pcichild() to pair with the get_pcichild() in
> > > > hv_pci_eject_device().
> > > Please refer to my replies above. IMO we should fix
> > > hv_eject_device_work() rather than pci_devices_present_work().
> >
> > Have we reached a conclusion on this ? I would like to merge this series
> > given that it is fixing bugs and it has hung in the balance for quite
> > a while but it looks like Michael is not too happy about these patches
> > and I need a maintainer ACK to merge them.
> >
> > Thanks,
> > Lorenzo
> 
> Dexuan and I have discussed the topic extensively offline.  The patch works
> in its current form, and I'll agree to it.
> 
> Reviewed-by:  Michael Kelley <mikelley@microsoft.com>

Thanks, Michael!

Hi Lorenzo,
All the 3 patches have got Michael's Reviewed-by.

Previously, Stephen Hemminger, one of the Hyper-V driver maintainers, 
provided his Reviewed-by in the " [PATCH 0/3]" mail:
https://lkml.org/lkml/2019/3/5/521

Thanks,
--Dexuan
Lorenzo Pieralisi March 26, 2019, 6:12 p.m. UTC | #6
On Tue, Mar 26, 2019 at 06:01:32PM +0000, Dexuan Cui wrote:

[...]

> > > Have we reached a conclusion on this ? I would like to merge this series
> > > given that it is fixing bugs and it has hung in the balance for quite
> > > a while but it looks like Michael is not too happy about these patches
> > > and I need a maintainer ACK to merge them.
> > >
> > > Thanks,
> > > Lorenzo
> > 
> > Dexuan and I have discussed the topic extensively offline.  The patch works
> > in its current form, and I'll agree to it.
> > 
> > Reviewed-by:  Michael Kelley <mikelley@microsoft.com>
> 
> Thanks, Michael!
> 
> Hi Lorenzo,
> All the 3 patches have got Michael's Reviewed-by.

Good, I asked because I do not want to merge patches with review
questions open, thanks for the clarifications.

> Previously, Stephen Hemminger, one of the Hyper-V driver maintainers, 
> provided his Reviewed-by in the " [PATCH 0/3]" mail:
> https://lkml.org/lkml/2019/3/5/521

I will reformat/rewrite the logs and notify you when queued.

Thanks,
Lorenzo
diff mbox series

Patch

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 95441a35eceb..30f16b882746 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1900,6 +1900,9 @@  static void hv_eject_device_work(struct work_struct *work)
 			 sizeof(*ejct_pkt), (unsigned long)&ctxt.pkt,
 			 VM_PKT_DATA_INBAND, 0);
 
+	/* For the get_pcichild() in hv_pci_eject_device() */
+	put_pcichild(hpdev);
+	/* For the two refs got in new_pcichild_device() */
 	put_pcichild(hpdev);
 	put_pcichild(hpdev);
 	put_hvpcibus(hpdev->hbus);