Message ID | 20231220005153.3984502-3-haifeng.zhao@linux.intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | fix vt-d hard lockup when hotplug ATS capable device | expand |
On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote: > For those endpoint devices connect to system via hotplug capable ports, > users could request a warm reset to the device by flapping device's link > through setting the slot's link control register, as pciehpt_ist() DLLSC > interrupt sequence response, pciehp will unload the device driver and > then power it off. thus cause an IOMMU devTLB flush request for device to > be sent and a long time completion/timeout waiting in interrupt context. I think the problem is in the "waiting in interrupt context". Can you change qi_submit_sync() to *sleep* until the queue is done? Instead of busy-waiting in atomic context? Is the hardware capable of sending an interrupt once the queue is done? If it is not capable, would it be viable to poll with exponential backoff and sleep in-between polling once the polling delay increases beyond, say, 10 usec? Again, the proposed patch is not a proper solution. It will paper over the issue most of the time but every once in a while someone will still get a hard lockup splat and it will then be more difficult to reproduce and fix if the proposed patch is accepted. > [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S > OE kernel version xxxx I don't see any reason to hide the kernel version. This isn't Intel Confidential information. > [ 4223.822628] Call Trace: > [ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0 > [ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250 > [ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50 __dmar_remove_one_dev_info() was removed by db75c9573b08 in v6.0 one and a half years ago, so the stack trace appears to be from an older kernel version. Thanks, Lukas
On Thu, Dec 21, 2023 at 11:39:40AM +0100, Lukas Wunner wrote: > On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote: > > For those endpoint devices connect to system via hotplug capable ports, > > users could request a warm reset to the device by flapping device's link > > through setting the slot's link control register, as pciehpt_ist() DLLSC > > interrupt sequence response, pciehp will unload the device driver and > > then power it off. thus cause an IOMMU devTLB flush request for device to > > be sent and a long time completion/timeout waiting in interrupt context. > > I think the problem is in the "waiting in interrupt context". I'm wondering whether Intel IOMMUs possibly have a (perhaps undocumented) capability to reduce the Invalidate Completion Timeout to a sane value? Could you check whether that's supported? Granted, the Implementation Note you've pointed to allows 1 sec + 50%, but that's not even a "must", it's a "should". So devices are free to take even longer. We have to cut off at *some* point. Thanks, Lukas
On 12/21/2023 6:39 PM, Lukas Wunner wrote: > On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote: >> For those endpoint devices connect to system via hotplug capable ports, >> users could request a warm reset to the device by flapping device's link >> through setting the slot's link control register, as pciehpt_ist() DLLSC >> interrupt sequence response, pciehp will unload the device driver and >> then power it off. thus cause an IOMMU devTLB flush request for device to >> be sent and a long time completion/timeout waiting in interrupt context. > I think the problem is in the "waiting in interrupt context". > > Can you change qi_submit_sync() to *sleep* until the queue is done? > Instead of busy-waiting in atomic context? If you read that function carefully, you wouldn't say "sleep" there.... that is 'sync'ed. > > Is the hardware capable of sending an interrupt once the queue is done? > If it is not capable, would it be viable to poll with exponential backoff > and sleep in-between polling once the polling delay increases beyond, say, > 10 usec? I don't know if the polling along sleeping for completion of meanningless devTLB invalidation request blindly sent to (removed/powered down/link down) device makes sense or not. But according to PCIe spec 6.1 10.3.1 "Software ensures no invalidations are issued to a Function when its ATS capability is disabled. " > > Again, the proposed patch is not a proper solution. It will paper over > the issue most of the time but every once in a while someone will still > get a hard lockup splat and it will then be more difficult to reproduce > and fix if the proposed patch is accepted. Could you point out why is not proper ? Is there any other window the hard lockup still could happen with the ATS capable devcie supprise_removal case if we checked the connection state first ? Please help to elaberate it. > > >> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S >> OE kernel version xxxx > I don't see any reason to hide the kernel version. > This isn't Intel Confidential information. > Yes, this is the old kernel stack trace, but customer also tried lasted 6.7rc4 (doesn't work) and the patched 6.7rc4 (fixed). Thanks, Ethan >> [ 4223.822628] Call Trace: >> [ 4223.822628] qi_flush_dev_iotlb+0xb1/0xd0 >> [ 4223.822628] __dmar_remove_one_dev_info+0x224/0x250 >> [ 4223.822629] dmar_remove_one_dev_info+0x3e/0x50 > __dmar_remove_one_dev_info() was removed by db75c9573b08 in v6.0 > one and a half years ago, so the stack trace appears to be from > an older kernel version. > > Thanks, > > Lukas
On 12/21/2023 7:01 PM, Lukas Wunner wrote: > On Thu, Dec 21, 2023 at 11:39:40AM +0100, Lukas Wunner wrote: >> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote: >>> For those endpoint devices connect to system via hotplug capable ports, >>> users could request a warm reset to the device by flapping device's link >>> through setting the slot's link control register, as pciehpt_ist() DLLSC >>> interrupt sequence response, pciehp will unload the device driver and >>> then power it off. thus cause an IOMMU devTLB flush request for device to >>> be sent and a long time completion/timeout waiting in interrupt context. >> I think the problem is in the "waiting in interrupt context". > I'm wondering whether Intel IOMMUs possibly have a (perhaps undocumented) > capability to reduce the Invalidate Completion Timeout to a sane value? > Could you check whether that's supported? It is not about Intel vt-d's capability per my understanding, it is the third party PCIe switch's capability, they are not aware of ATS transation at all, if its downstream port endpoint device is removed/powered-off/link-down, it couldn't feedback the upstream iommu a fault/completion/timeout for ATS transaction breakage reason. While the root port could (verified). > > Granted, the Implementation Note you've pointed to allows 1 sec + 50%, 1 min (60 sec)+50% > but that's not even a "must", it's a "should". So devices are free to I could happen if blindly wait here, so we should avoid such case. Thanks, Ethan > take even longer. We have to cut off at *some* point. > > Thanks, > > Lukas
On 12/21/2023 7:01 PM, Lukas Wunner wrote: > On Thu, Dec 21, 2023 at 11:39:40AM +0100, Lukas Wunner wrote: >> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote: >>> For those endpoint devices connect to system via hotplug capable ports, >>> users could request a warm reset to the device by flapping device's link >>> through setting the slot's link control register, as pciehpt_ist() DLLSC >>> interrupt sequence response, pciehp will unload the device driver and >>> then power it off. thus cause an IOMMU devTLB flush request for device to >>> be sent and a long time completion/timeout waiting in interrupt context. >> I think the problem is in the "waiting in interrupt context". > I'm wondering whether Intel IOMMUs possibly have a (perhaps undocumented) > capability to reduce the Invalidate Completion Timeout to a sane value? > Could you check whether that's supported? > > Granted, the Implementation Note you've pointed to allows 1 sec + 50%, > but that's not even a "must", it's a "should". So devices are free to > take even longer. We have to cut off at *some* point. I really "expected" there is interrrupt signal to iommu hardware when the PCIe swtich downstream device 'gone', or some internal polling /heartbeating the endpoint device for ATS breaking, but so far seems there are only hotplug interrupts to downstream control. ... How to define the point "some" msec to timeout while software break out the waiting loop ? or polling if the target is gone ? Thanks, Ethan > > Thanks, > > Lukas
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 74e8e4c17e81..7dbee9931eb6 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -481,6 +481,9 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu, if (!info || !info->ats_enabled) return; + if (pci_dev_is_disconnected(to_pci_dev(dev))) + return; + sid = info->bus << 8 | info->devfn; qdep = info->ats_qdep; pfsid = info->pfsid;