diff mbox series

[2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected

Message ID 20231213034637.2603013-3-haifeng.zhao@linux.intel.com (mailing list archive)
State Superseded
Headers show
Series fix vt-d hard lockup when hotplug ATS capable device | expand

Commit Message

Ethan Zhao Dec. 13, 2023, 3:46 a.m. UTC
For those endpoint devices connect to system via hotplug capable ports,
users could request a warm reset to the device by flapping device's link
through setting the slot's link control register, as pciehpt_ist() DLLSC
interrupt sequence response, pciehp will unload the device driver and
then power it off. thus cause an IOMMU devTLB flush request for device to
be sent and a long time completion/timeout waiting in interrupt context.

That would cause following continuous hard lockup warning and system hang
as following

[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
         OE    kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS:  0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628]  qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628]  __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629]  dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629]  intel_iommu_release_device+0x1f/0x30
[ 4223.822629]  iommu_release_device+0x33/0x60
[ 4223.822629]  iommu_bus_notifier+0x7f/0x90
[ 4223.822630]  blocking_notifier_call_chain+0x60/0x90
[ 4223.822630]  device_del+0x2e5/0x420
[ 4223.822630]  pci_remove_bus_device+0x70/0x110
[ 4223.822630]  pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631]  pciehp_disable_slot+0x6b/0x100
[ 4223.822631]  pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631]  pciehp_ist+0x176/0x180
[ 4223.822631]  ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632]  irq_thread_fn+0x19/0x50
[ 4223.822632]  irq_thread+0x104/0x190
[ 4223.822632]  ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632]  ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633]  kthread+0x114/0x130
[ 4223.822633]  ? __kthread_cancel_work+0x40/0x40
[ 4223.822633]  ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
         OE     kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634]  <NMI>
[ 4223.822635]  dump_stack+0x6d/0x88
[ 4223.822635]  panic+0x101/0x2d0
[ 4223.822635]  ? ret_from_fork+0x11/0x30
[ 4223.822635]  nmi_panic.cold.14+0xc/0xc
[ 4223.822636]  watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636]  __perf_event_overflow+0x4f/0xf0
[ 4223.822636]  handle_pmi_common+0x1ef/0x290
[ 4223.822636]  ? __set_pte_vaddr+0x28/0x40
[ 4223.822637]  ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637]  ? __native_set_fixmap+0x24/0x30
[ 4223.822637]  ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637]  ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637]  intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638]  perf_event_nmi_handler+0x24/0x40
[ 4223.822638]  nmi_handle+0x4d/0xf0
[ 4223.822638]  default_do_nmi+0x49/0x100
[ 4223.822638]  exc_nmi+0x134/0x180
[ 4223.822639]  end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641]  ? qi_submit_sync+0x2c0/0x490
[ 4223.822642]  ? qi_submit_sync+0x2c0/0x490
[ 4223.822642]  </NMI>
[ 4223.822642]  qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642]  __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643]  dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643]  intel_iommu_release_device+0x1f/0x30
[ 4223.822643]  iommu_release_device+0x33/0x60
[ 4223.822643]  iommu_bus_notifier+0x7f/0x90
[ 4223.822644]  blocking_notifier_call_chain+0x60/0x90
[ 4223.822644]  device_del+0x2e5/0x420
[ 4223.822644]  pci_remove_bus_device+0x70/0x110
[ 4223.822644]  pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644]  pciehp_disable_slot+0x6b/0x100
[ 4223.822645]  pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645]  pciehp_ist+0x176/0x180
[ 4223.822645]  ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645]  irq_thread_fn+0x19/0x50
[ 4223.822646]  irq_thread+0x104/0x190
[ 4223.822646]  ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646]  ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646]  kthread+0x114/0x130
[ 4223.822647]  ? __kthread_cancel_work+0x40/0x40
[ 4223.822647]  ret_from_fork+0x1f/0x30
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)

Fix it by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then powered off in

pciehp_ist()
   pciehp_handle_presence_or_link_change()
     pciehp_disable_slot()
       remove_board()
         pciehp_unconfigure_device()

Tested-by: Haorong Ye <yehaorong@bytedance.com>
Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
---
 drivers/iommu/intel/pasid.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Lukas Wunner Dec. 13, 2023, 10:44 a.m. UTC | #1
On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
> For those endpoint devices connect to system via hotplug capable ports,
> users could request a warm reset to the device by flapping device's link
> through setting the slot's link control register,

Well, users could just *unplug* the device, right?  Why is it relevant
that thay could fiddle with registers in config space?


> as pciehpt_ist() DLLSC
> interrupt sequence response, pciehp will unload the device driver and
> then power it off. thus cause an IOMMU devTLB flush request for device to
> be sent and a long time completion/timeout waiting in interrupt context.

A completion timeout should be on the order of usecs or msecs, why does it
cause a hard lockup?  The dmesg excerpt you've provided shows a 12 *second*
delay between hot removal and watchdog reaction.


> Fix it by checking the device's error_state in
> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
> request to link down device that is set to pci_channel_io_perm_failure and
> then powered off in

This doesn't seem to be a proper fix.  It will work most of the time
but not always.  A user might bring down the slot via sysfs, then yank
the card from the slot just when the iommu flush occurs such that the
pci_dev_is_disconnected(pdev) check returns false but the card is
physically gone immediately afterwards.  In other words, you've shrunk
the time window during which the issue may occur, but haven't eliminated
it completely.

Thanks,

Lukas
Robin Murphy Dec. 13, 2023, 11:54 a.m. UTC | #2
On 13/12/2023 10:44 am, Lukas Wunner wrote:
> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register,
> 
> Well, users could just *unplug* the device, right?  Why is it relevant
> that thay could fiddle with registers in config space?
> 
> 
>> as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU devTLB flush request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
> 
> A completion timeout should be on the order of usecs or msecs, why does it
> cause a hard lockup?  The dmesg excerpt you've provided shows a 12 *second*
> delay between hot removal and watchdog reaction.

The PCIe spec only requires an endpoint to respond to an ATS invalidate 
within a rather hilarious 90 seconds, so it's primarily a question of 
how patient the root complex and bridges in between are prepared to be.

>> Fix it by checking the device's error_state in
>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
>> request to link down device that is set to pci_channel_io_perm_failure and
>> then powered off in
> 
> This doesn't seem to be a proper fix.  It will work most of the time
> but not always.  A user might bring down the slot via sysfs, then yank
> the card from the slot just when the iommu flush occurs such that the
> pci_dev_is_disconnected(pdev) check returns false but the card is
> physically gone immediately afterwards.  In other words, you've shrunk
> the time window during which the issue may occur, but haven't eliminated
> it completely.

Yeah, I think we have a subtle but fundamental issue here in that the 
iommu_release_device() callback is hooked to BUS_NOTIFY_REMOVED_DEVICE, 
so in general probably shouldn't be assuming it's safe to do anything 
with the device itself *after* it's already been removed from its bus - 
this step is primarily about cleaning up any of the IOMMU's own state 
relating to the given device.

I think if we want to ensure ATCs are invalidated on hot-unplug we need 
an additional pre-removal notifier to take care of that, and that step 
would then want to distinguish between an orderly removal where cleaning 
up is somewhat meaningful, and a surprise removal where it definitely isn't.

Thanks,
Robin.
Baolu Lu Dec. 13, 2023, 11:59 a.m. UTC | #3
On 2023/12/13 11:46, Ethan Zhao wrote:
> For those endpoint devices connect to system via hotplug capable ports,
> users could request a warm reset to the device by flapping device's link
> through setting the slot's link control register, as pciehpt_ist() DLLSC
> interrupt sequence response, pciehp will unload the device driver and
> then power it off.

Is it possible for pciehp to disable ATS on the device before unloading
the driver? Or should the device follow some specific steps to warm
reset the device?

What happens if IOMMU issues device TLB invalidation after link down but
before pci_dev_is_disconnected() returns true?

Best regards,
baolu
Ethan Zhao Dec. 14, 2023, 2:16 a.m. UTC | #4
On 12/13/2023 6:44 PM, Lukas Wunner wrote:
> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register,
> Well, users could just *unplug* the device, right?  Why is it relevant
> that thay could fiddle with registers in config space?
>
Yes, if the device and it's slot are hotplug capable, users could just

'unplug' the device.

But this case reported, users try to do a warm reset with a tool

command like:

   mlxfwreset -d <busid> -y reset

Actually, it will access configuration space  just as

  setpci -s 0000:17:01.0 0x78.L=0x21050010

Well, we couldn't say don't fiddle PCIe config space registers like

that.

>> as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU devTLB flush request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
> A completion timeout should be on the order of usecs or msecs, why does it
> cause a hard lockup?  The dmesg excerpt you've provided shows a 12 *second*
> delay between hot removal and watchdog reaction.
>
In my understanding, the devTLB flush request sent to ATS capable devcie

is non-posted request, if the ATS transaction is broken by endpoint link

-down, power-off event, the timeout will take up to 60 seconds+-30,

see "Invalidate Completion Timeout " part of

chapter 10.3.1 Invalidate Request

In PCIe spec 6.1

"

IMPLEMENTATION NOTE:

INVALIDATE COMPLETION TIMEOUT

Devices should respond to Invalidate Requests within 1 minute (+50% 
-0%).Having a bounded time

permits an ATPT to implement Invalidate Completion Timeouts and reuse 
the associated ITag values.

ATPT designs are implementation specific. As such, Invalidate Completion 
Timeouts and their

associated error handling are outside the scope of this specification

"

>> Fix it by checking the device's error_state in
>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
>> request to link down device that is set to pci_channel_io_perm_failure and
>> then powered off in
> This doesn't seem to be a proper fix.  It will work most of the time
> but not always.  A user might bring down the slot via sysfs, then yank
> the card from the slot just when the iommu flush occurs such that the
> pci_dev_is_disconnected(pdev) check returns false but the card is
> physically gone immediately afterwards.  In other words, you've shrunk
> the time window during which the issue may occur, but haven't eliminated
> it completely.

If you mean disable the slot via sysfs, that's SAFE_REMOVAL, right ?

that would issse devTLB invalidation first, power off device later, it

wouldn't trigger the hard lockup, though the

pci_dev_is_disconnected() return false. this fix works such case.


Thanks,

Ethan



>
> Thanks,
>
> Lukas
Ethan Zhao Dec. 14, 2023, 2:26 a.m. UTC | #5
On 12/13/2023 7:59 PM, Baolu Lu wrote:
> On 2023/12/13 11:46, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off.
>
> Is it possible for pciehp to disable ATS on the device before unloading
> the driver? Or should the device follow some specific steps to warm
> reset the device?
>
In this case, link down first, then pciehp_ist() got DLLSC interrupt to 
know

that, I don't think it makes sense to disable device ATS here, but it could

flag the device is ATS disabled, well,  "disconnected" is enough to let

vt-d like software knows the device state.


> What happens if IOMMU issues device TLB invalidation after link down but
> before pci_dev_is_disconnected() returns true?

Seems it wouldn't happen with hotplug cases, safe_removal or

supprise_removal.



Thanks,

Ethan

>
> Best regards,
> baolu
Ethan Zhao Dec. 14, 2023, 2:40 a.m. UTC | #6
On 12/13/2023 7:54 PM, Robin Murphy wrote:
> On 13/12/2023 10:44 am, Lukas Wunner wrote:
>> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's 
>>> link
>>> through setting the slot's link control register,
>>
>> Well, users could just *unplug* the device, right?  Why is it relevant
>> that thay could fiddle with registers in config space?
>>
>>
>>> as pciehpt_ist() DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for 
>>> device to
>>> be sent and a long time completion/timeout waiting in interrupt 
>>> context.
>>
>> A completion timeout should be on the order of usecs or msecs, why 
>> does it
>> cause a hard lockup?  The dmesg excerpt you've provided shows a 12 
>> *second*
>> delay between hot removal and watchdog reaction.
>
> The PCIe spec only requires an endpoint to respond to an ATS 
> invalidate within a rather hilarious 90 seconds, so it's primarily a 
> question of how patient the root complex and bridges in between are 
> prepared to be.

The issue reported only reproduce with endpoint device connects to 
system via PCIe switch (only has read tracking feature), those switchses 
seem not be aware of ATS transaction. while root port is aware of ATS

while the ATS transaction is broken. (invalidation sent, but link down, 
device removed etc). but I didn't find any public spec about that.

>
>>> Fix it by checking the device's error_state in
>>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB 
>>> flush
>>> request to link down device that is set to 
>>> pci_channel_io_perm_failure and
>>> then powered off in
>>
>> This doesn't seem to be a proper fix.  It will work most of the time
>> but not always.  A user might bring down the slot via sysfs, then yank
>> the card from the slot just when the iommu flush occurs such that the
>> pci_dev_is_disconnected(pdev) check returns false but the card is
>> physically gone immediately afterwards.  In other words, you've shrunk
>> the time window during which the issue may occur, but haven't eliminated
>> it completely.
>
> Yeah, I think we have a subtle but fundamental issue here in that the 
> iommu_release_device() callback is hooked to 
> BUS_NOTIFY_REMOVED_DEVICE, so in general probably shouldn't be 
> assuming it's safe to do anything with the device itself *after* it's 
> already been removed from its bus - this step is primarily about 
> cleaning up any of the IOMMU's own state relating to the given device.
>
> I think if we want to ensure ATCs are invalidated on hot-unplug we 
> need an additional pre-removal notifier to take care of that, and that 
> step would then want to distinguish between an orderly removal where 
> cleaning up is somewhat meaningful, and a surprise removal where it 
> definitely isn't.

So, at least, we should check device state before issue devTLB invaliation.


Thanks,

Ethan

>
>
> Thanks,
> Robin.
Ethan Zhao Dec. 15, 2023, 12:43 a.m. UTC | #7
On 12/14/2023 10:16 AM, Ethan Zhao wrote:
>
> On 12/13/2023 6:44 PM, Lukas Wunner wrote:
>> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's 
>>> link
>>> through setting the slot's link control register,
>> Well, users could just *unplug* the device, right?  Why is it relevant
>> that thay could fiddle with registers in config space?
>>
> Yes, if the device and it's slot are hotplug capable, users could just
>
> 'unplug' the device.
>
> But this case reported, users try to do a warm reset with a tool
>
> command like:
>
>   mlxfwreset -d <busid> -y reset
>
> Actually, it will access configuration space  just as
>
>  setpci -s 0000:17:01.0 0x78.L=0x21050010
>
> Well, we couldn't say don't fiddle PCIe config space registers like
>
> that.
>
>>> as pciehpt_ist() DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for 
>>> device to
>>> be sent and a long time completion/timeout waiting in interrupt 
>>> context.
>> A completion timeout should be on the order of usecs or msecs, why 
>> does it
>> cause a hard lockup?  The dmesg excerpt you've provided shows a 12 
>> *second*
>> delay between hot removal and watchdog reaction.
>>
> In my understanding, the devTLB flush request sent to ATS capable devcie
>
> is non-posted request, if the ATS transaction is broken by endpoint link
>
> -down, power-off event, the timeout will take up to 60 seconds+-30,
>
> see "Invalidate Completion Timeout " part of
>
> chapter 10.3.1 Invalidate Request
>
> In PCIe spec 6.1
>
> "
>
> IMPLEMENTATION NOTE:
>
> INVALIDATE COMPLETION TIMEOUT
>
> Devices should respond to Invalidate Requests within 1 minute (+50% 
> -0%).Having a bounded time
>
> permits an ATPT to implement Invalidate Completion Timeouts and reuse 
> the associated ITag values.
>
> ATPT designs are implementation specific. As such, Invalidate 
> Completion Timeouts and their
>
> associated error handling are outside the scope of this specification
>
> "
>
>>> Fix it by checking the device's error_state in
>>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB 
>>> flush
>>> request to link down device that is set to 
>>> pci_channel_io_perm_failure and
>>> then powered off in
>> This doesn't seem to be a proper fix.  It will work most of the time
>> but not always.  A user might bring down the slot via sysfs, then yank
>> the card from the slot just when the iommu flush occurs such that the
>> pci_dev_is_disconnected(pdev) check returns false but the card is
>> physically gone immediately afterwards.  In other words, you've shrunk
>> the time window during which the issue may occur, but haven't eliminated
>> it completely.
>
> If you mean disable the slot via sysfs, that's SAFE_REMOVAL, right ?
>
> that would issse devTLB invalidation first, power off device later, it
>
> wouldn't trigger the hard lockup, though the
>
> pci_dev_is_disconnected() return false. this fix works such case.

Could you help to point out if there are any other window to close ?

Thanks,

Ethan


>
>
> Thanks,
>
> Ethan
>
>
>
>>
>> Thanks,
>>
>> Lukas
Ethan Zhao Dec. 15, 2023, 1:03 a.m. UTC | #8
On 12/14/2023 10:26 AM, Ethan Zhao wrote:
>
> On 12/13/2023 7:59 PM, Baolu Lu wrote:
>> On 2023/12/13 11:46, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's 
>>> link
>>> through setting the slot's link control register, as pciehpt_ist() 
>>> DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off.
>>
>> Is it possible for pciehp to disable ATS on the device before unloading
>> the driver? Or should the device follow some specific steps to warm
>> reset the device?
>>
> In this case, link down first, then pciehp_ist() got DLLSC interrupt 
> to know
>
> that, I don't think it makes sense to disable device ATS here, but it 
> could
>
> flag the device is ATS disabled, well,  "disconnected" is enough to let
>
> vt-d like software knows the device state.
>
>
For hot "unplug" cases:

1. safe_removal

   Users request unplug the device via sysfs or press the attention button,

   Then pciehp will response to unconfig device/unload device driver, power

   if off, and devcie is ready to remove. in this case, devTLB invalidate

   request is sent before device link to be brought down or device power

   to be turned off. so it doesn't trigger the hard lockup.

2. supprise_removal

  Users remove the devece directly or bring the device link down/turn off

  device power first by setting pci config space, link-down/not-present/

  power-off are all handled by pciehp the same way "supprise_removal", in

  such case, pciehp_ist() will flag the device as "disconnected" first, then

  unconfig the devcie, unload driver, iommu release device(issing devTLB 
flush)

  delete device. so we checking the device state could work such cases.

But I am still think about if there are other windows.


Thanks,

Ethan


>> What happens if IOMMU issues device TLB invalidation after link down but
>> before pci_dev_is_disconnected() returns true?
>
> Seems it wouldn't happen with hotplug cases, safe_removal or
>
> supprise_removal.
>
>
>
> Thanks,
>
> Ethan
>
>>
>> Best regards,
>> baolu
Baolu Lu Dec. 15, 2023, 1:34 a.m. UTC | #9
On 2023/12/15 9:03, Ethan Zhao wrote:
> 
> 2. supprise_removal
> 
>   Users remove the devece directly or bring the device link down/turn off
> 
>   device power first by setting pci config space, link-down/not-present/
> 
>   power-off are all handled by pciehp the same way "supprise_removal", in
> 
>   such case, pciehp_ist() will flag the device as "disconnected" first, 
> then
> 
>   unconfig the devcie, unload driver, iommu release device(issing devTLB 
> flush)
> 
>   delete device. so we checking the device state could work such cases.

If so, then it is fine for the iommu driver. As Robin said, if the
device needs more cleanup, the iommu core should register a right
callback to the driver core and handle it before the device goes away.

Disabling PCI features seems to be a reasonable device cleanup. This
gives us another reason to move ATS enabling/disabling out from the
iommu subsystem. Once this is done, the device driver will enable ATS
during its probe and disable it during its release. There will be no
such workaround in the iommu driver anymore.

Best regards,
baolu
Ethan Zhao Dec. 15, 2023, 1:51 a.m. UTC | #10
On 12/15/2023 9:34 AM, Baolu Lu wrote:
> On 2023/12/15 9:03, Ethan Zhao wrote:
>>
>> 2. supprise_removal
>>
>>   Users remove the devece directly or bring the device link down/turn 
>> off
>>
>>   device power first by setting pci config space, link-down/not-present/
>>
>>   power-off are all handled by pciehp the same way 
>> "supprise_removal", in
>>
>>   such case, pciehp_ist() will flag the device as "disconnected" 
>> first, then
>>
>>   unconfig the devcie, unload driver, iommu release device(issing 
>> devTLB flush)
>>
>>   delete device. so we checking the device state could work such cases.
>
> If so, then it is fine for the iommu driver. As Robin said, if the
> device needs more cleanup, the iommu core should register a right
> callback to the driver core and handle it before the device goes away.
>
> Disabling PCI features seems to be a reasonable device cleanup. This
> gives us another reason to move ATS enabling/disabling out from the

For supprise_removal, device was already removed, powered-off, iommu

device-release got notification  or cleanup callback is  invoked to disable

ATS to not-present device etc ,

I didn't get the meaning to do so, perhaps I misunderstand ?

Thanks,

Ethan

> iommu subsystem. Once this is done, the device driver will enable ATS
> during its probe and disable it during its release. There will be no
> such workaround in the iommu driver anymore.
>
> Best regards,
> baolu
Lukas Wunner Dec. 21, 2023, 10:42 a.m. UTC | #11
On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote:
> I think if we want to ensure ATCs are invalidated on hot-unplug we need an
> additional pre-removal notifier to take care of that, and that step would
> then want to distinguish between an orderly removal where cleaning up is
> somewhat meaningful, and a surprise removal where it definitely isn't.

Even if a user starts the process for orderly removal, the device may be
surprise-removed *during* that process.  So we cannot assume that the
device is actually accessible if orderly removal has been initiated.
If the form factor supports surprise removal, the device may be gone
at any time.

Thanks,

Lukas
Robin Murphy Dec. 21, 2023, 11:01 a.m. UTC | #12
On 2023-12-21 10:42 am, Lukas Wunner wrote:
> On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote:
>> I think if we want to ensure ATCs are invalidated on hot-unplug we need an
>> additional pre-removal notifier to take care of that, and that step would
>> then want to distinguish between an orderly removal where cleaning up is
>> somewhat meaningful, and a surprise removal where it definitely isn't.
> 
> Even if a user starts the process for orderly removal, the device may be
> surprise-removed *during* that process.  So we cannot assume that the
> device is actually accessible if orderly removal has been initiated.
> If the form factor supports surprise removal, the device may be gone
> at any time.

Sure, whatever we do there's always going to be some unavoidable 
time-of-check-to-time-of-use race window so we can never guarantee that 
sending a request to the device will succeed. I was just making the 
point that if we *have* already detected a surprise removal, then 
cleaning up its leftover driver model state should still generate a 
BUS_NOTIFY_REMOVE_DEVICE call, but in that case we can know there's no 
point trying to send any requests to the device that's already gone.

Thanks,
Robin.
Lukas Wunner Dec. 21, 2023, 11:07 a.m. UTC | #13
On Thu, Dec 21, 2023 at 11:01:56AM +0000, Robin Murphy wrote:
> On 2023-12-21 10:42 am, Lukas Wunner wrote:
> > On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote:
> > > I think if we want to ensure ATCs are invalidated on hot-unplug we need an
> > > additional pre-removal notifier to take care of that, and that step would
> > > then want to distinguish between an orderly removal where cleaning up is
> > > somewhat meaningful, and a surprise removal where it definitely isn't.
> > 
> > Even if a user starts the process for orderly removal, the device may be
> > surprise-removed *during* that process.  So we cannot assume that the
> > device is actually accessible if orderly removal has been initiated.
> > If the form factor supports surprise removal, the device may be gone
> > at any time.
> 
> Sure, whatever we do there's always going to be some unavoidable
> time-of-check-to-time-of-use race window so we can never guarantee that
> sending a request to the device will succeed. I was just making the point
> that if we *have* already detected a surprise removal, then cleaning up its
> leftover driver model state should still generate a BUS_NOTIFY_REMOVE_DEVICE
> call, but in that case we can know there's no point trying to send any
> requests to the device that's already gone.

Right, using pci_dev_is_disconnected() as a *speedup* when we
definitely know the device is gone, that's perfectly fine.

So in that sense the proposed patch is acceptable *after* this
series has been extended to make sure hard lockups can *never*
occur on unplug.

Thanks,

Lukas
Ethan Zhao Dec. 22, 2023, 3:20 a.m. UTC | #14
On 12/21/2023 6:42 PM, Lukas Wunner wrote:
> On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote:
>> I think if we want to ensure ATCs are invalidated on hot-unplug we need an
>> additional pre-removal notifier to take care of that, and that step would
>> then want to distinguish between an orderly removal where cleaning up is
>> somewhat meaningful, and a surprise removal where it definitely isn't.
> Even if a user starts the process for orderly removal, the device may be
> surprise-removed *during* that process.  So we cannot assume that the
> device is actually accessible if orderly removal has been initiated.
> If the form factor supports surprise removal, the device may be gone

There is no hardware lock to prevent user powerring-off/removing

the supprise-removal capable device before issuing ATS invalidation

request but after checking device connection state, the no target

request still possibly be sent.


Thanks,

Ethan

> at any time.
>
> Thanks,
>
> Lukas
>
diff mbox series

Patch

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 74e8e4c17e81..8557b6dee22f 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -476,6 +476,23 @@  devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
 {
 	struct device_domain_info *info;
 	u16 sid, qdep, pfsid;
+	struct pci_dev *pdev;
+
+	pdev = to_pci_dev(dev);
+	if (!pdev)
+		return;
+
+	/*
+	 * If endpoint device's link was brough down by user's pci configuration
+	 * access to it's hotplug capable slot's link control register, as sequence
+	 * response for DLLSC pciehp_ist() will set the device error_state to
+	 * pci_channel_io_perm_failure. Checking device's state here to avoid
+	 * issuing meaningless devTLB flush request to it, that might cause lockup
+	 * warning or deadlock because too long time waiting in interrupt context.
+	 */
+
+	if (pci_dev_is_disconnected(pdev))
+		return;
 
 	info = dev_iommu_priv_get(dev);
 	if (!info || !info->ats_enabled)
@@ -495,6 +512,8 @@  devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
 		qi_flush_dev_iotlb(iommu, sid, pfsid, qdep, 0, 64 - VTD_PAGE_SHIFT);
 	else
 		qi_flush_dev_iotlb_pasid(iommu, sid, pfsid, pasid, qdep, 0, 64 - VTD_PAGE_SHIFT);
+
+	return;
 }
 
 void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,