Message ID | 1619070346-21557-2-git-send-email-longli@linuxonhyperv.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v2,1/2] PCI: hv: Fix a race condition when removing the device | expand |
> From: longli@linuxonhyperv.com <longli@linuxonhyperv.com> > Sent: Wednesday, April 21, 2021 10:46 PM > > With the new method of flushing/stopping the workqueue before doing bus > removal, the old mechanisum of using refcount and wait for completion mechanisum -> mechanism > is no longer needed. Remove those dead code. > > Signed-off-by: Long Li <longli@microsoft.com> > --- The patch looks good to me. BTW, can we also remove get_pcichild() and put_pcichild() in an extra patch? I suspect we don't really need those either.
> Subject: RE: [Patch v2 2/2] PCI: hv: Remove unused refcount and supporting > functions for handling bus device removal > > > From: longli@linuxonhyperv.com <longli@linuxonhyperv.com> > > Sent: Wednesday, April 21, 2021 10:46 PM > > > > With the new method of flushing/stopping the workqueue before doing > > bus removal, the old mechanisum of using refcount and wait for > > completion > > mechanisum -> mechanism > > > is no longer needed. Remove those dead code. > > > > Signed-off-by: Long Li <longli@microsoft.com> > > --- > > The patch looks good to me. BTW, can we also remove get_pcichild() and > put_pcichild() in an extra patch? I suspect we don't really need those either. Those two functions are for protecting accessing to the devices on the hbus. There are interactions from PCI layer that need guarantee from hbus that the device is present at the time of access. Why do you think we don't' need those?
> From: Long Li <longli@microsoft.com> > Sent: Friday, April 23, 2021 11:40 AM > To: Dexuan Cui <decui@microsoft.com>; longli@linuxonhyperv.com; KY > > > Subject: RE: [Patch v2 2/2] PCI: hv: Remove unused refcount and supporting > > functions for handling bus device removal > > > > > From: longli@linuxonhyperv.com <longli@linuxonhyperv.com> > > > Sent: Wednesday, April 21, 2021 10:46 PM > > > > > > With the new method of flushing/stopping the workqueue before doing > > > bus removal, the old mechanisum of using refcount and wait for > > > completion > > > > mechanisum -> mechanism > > > > > is no longer needed. Remove those dead code. > > > > > > Signed-off-by: Long Li <longli@microsoft.com> > > > > The patch looks good to me. BTW, can we also remove get_pcichild() and > > put_pcichild() in an extra patch? I suspect we don't really need those either. > > Those two functions are for protecting accessing to the devices on the hbus. > There are interactions from PCI layer that need guarantee from hbus that the > device is present at the time of access. > > Why do you think we don't' need those? IMO there is proper locking and synchronization logic in the PCI layer, so I don't think the 'hpdev' struct can vanish when it's being accessed from the PCI layer. I think the 'hpdev' struct can only be freed in two scenarios: 1) the PCI device is removed: the VM receives the PCI_EJECT message, hv_eject_device_work() calls pci_stop_and_remove_bus_device() to deregister the pci_dev from the PCI layer, and then does other cleanup, and finally call kfree(hpdev) in the third put_pcichild() in hv_eject_device_work(). 2) the pci-hyperv driver is unloaded: in this case, hv_pci_remove() calls pci_remove_root_bus() to deregister the pci_dev, and then calls hv_pci_bus_exit() -> hv_pci_start_relations_work(), and eventually pci_devices_present_work() decreases the ref counter to zero and free the 'hpdev'. In both the case, when the hpdev or the pdev is still being used by the PCI layer, I think the pci_stop_and_remove_bus_device() and pci_remove_root_bus() should be blocked, i.e. the hpdev can't be freed even if we don't have the ref counter mechanism. For example. we know the 'lspci' program can read the PCI device's config space directly via the sysfile /sys/bus/pci/devices/XXXX/config; when 'lspci' is reading the config space, the code path can be: do_syscall_64 ksys_pread64 vfs_read new_sync_read kernfs_fop_read_iter kernfs_file_read_iter kernfs_get_active -> atomic_inc_unless_negative(&kn->active). pci_read_config pci_user_read_config_dword hv_pcifront_read_config _hv_pcifront_read_config At this moment, if the host tries to remove the PCI device, the PCI_EJECT code path will be blocked due to the kn->active: hv_eject_device_work() pci_stop_and_remove_bus_device() pci_stop_bus_device pci_remove_sysfs_dev_files kernfs_remove_by_name_ns __kernfs_remove kernfs_drain wait_event(root->deactivate_waitq, atomic_read(&kn->active) == KN_DEACTIVATED_BIAS); I don't check all the scenarios and code paths, but generally speaking I suppose the PCI subsystem should already have the proper locking and synchronization logic we need here. PS, I'm not sure if the host can remove the device by only sending a PCI_BUS_RELATIONS message with bus_rel->device_count == 0 (i.e. no PCI_EJECT message): in this case, if we don't have the ref counter for hpdev, pci_devices_present_work() frees the hpdev before calling pci_scan_child_bus() and this can be a potential race condition. But IMO this can be easily fixed by moving "free the hpdev" to a later place, afer pci_scan_child_bus() is called. Thanks, -- Dexuan
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index fc948a2ed703..e6b4ee323068 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -452,7 +452,6 @@ struct hv_pcibus_device { /* Protocol version negotiated with the host */ enum pci_protocol_version_t protocol_version; enum hv_pcibus_state state; - refcount_t remove_lock; struct hv_device *hdev; resource_size_t low_mmio_space; resource_size_t high_mmio_space; @@ -460,7 +459,6 @@ struct hv_pcibus_device { struct resource *low_mmio_res; struct resource *high_mmio_res; struct completion *survey_event; - struct completion remove_event; struct pci_bus *pci_bus; spinlock_t config_lock; /* Avoid two threads writing index page */ spinlock_t device_list_lock; /* Protect lists below */ @@ -593,9 +591,6 @@ static void put_pcichild(struct hv_pci_dev *hpdev) kfree(hpdev); } -static void get_hvpcibus(struct hv_pcibus_device *hv_pcibus); -static void put_hvpcibus(struct hv_pcibus_device *hv_pcibus); - /* * There is no good way to get notified from vmbus_onoffer_rescind(), * so let's use polling here, since this is not a hot path. @@ -2067,10 +2062,8 @@ static void pci_devices_present_work(struct work_struct *work) } spin_unlock_irqrestore(&hbus->device_list_lock, flags); - if (!dr) { - put_hvpcibus(hbus); + if (!dr) return; - } /* First, mark all existing children as reported missing. */ spin_lock_irqsave(&hbus->device_list_lock, flags); @@ -2153,7 +2146,6 @@ static void pci_devices_present_work(struct work_struct *work) break; } - put_hvpcibus(hbus); kfree(dr); } @@ -2194,12 +2186,10 @@ static int hv_pci_start_relations_work(struct hv_pcibus_device *hbus, list_add_tail(&dr->list_entry, &hbus->dr_list); spin_unlock_irqrestore(&hbus->device_list_lock, flags); - if (pending_dr) { + if (pending_dr) kfree(dr_wrk); - } else { - get_hvpcibus(hbus); + else queue_work(hbus->wq, &dr_wrk->wrk); - } return 0; } @@ -2342,8 +2332,6 @@ static void hv_eject_device_work(struct work_struct *work) put_pcichild(hpdev); put_pcichild(hpdev); /* hpdev has been freed. Do not use it any more. */ - - put_hvpcibus(hbus); } /** @@ -2367,7 +2355,6 @@ static void hv_pci_eject_device(struct hv_pci_dev *hpdev) hpdev->state = hv_pcichild_ejecting; get_pcichild(hpdev); INIT_WORK(&hpdev->wrk, hv_eject_device_work); - get_hvpcibus(hbus); queue_work(hbus->wq, &hpdev->wrk); } @@ -2967,17 +2954,6 @@ static int hv_send_resources_released(struct hv_device *hdev) return 0; } -static void get_hvpcibus(struct hv_pcibus_device *hbus) -{ - refcount_inc(&hbus->remove_lock); -} - -static void put_hvpcibus(struct hv_pcibus_device *hbus) -{ - if (refcount_dec_and_test(&hbus->remove_lock)) - complete(&hbus->remove_event); -} - #define HVPCI_DOM_MAP_SIZE (64 * 1024) static DECLARE_BITMAP(hvpci_dom_map, HVPCI_DOM_MAP_SIZE); @@ -3097,14 +3073,12 @@ static int hv_pci_probe(struct hv_device *hdev, hbus->sysdata.domain = dom; hbus->hdev = hdev; - refcount_set(&hbus->remove_lock, 1); INIT_LIST_HEAD(&hbus->children); INIT_LIST_HEAD(&hbus->dr_list); INIT_LIST_HEAD(&hbus->resources_for_children); spin_lock_init(&hbus->config_lock); spin_lock_init(&hbus->device_list_lock); spin_lock_init(&hbus->retarget_msi_interrupt_lock); - init_completion(&hbus->remove_event); hbus->wq = alloc_ordered_workqueue("hv_pci_%x", 0, hbus->sysdata.domain); if (!hbus->wq) { @@ -3332,8 +3306,6 @@ static int hv_pci_remove(struct hv_device *hdev) hv_pci_free_bridge_windows(hbus); irq_domain_remove(hbus->irq_domain); irq_domain_free_fwnode(hbus->sysdata.fwnode); - put_hvpcibus(hbus); - wait_for_completion(&hbus->remove_event); hv_put_dom_num(hbus->sysdata.domain);