diff mbox series

[RFC,v2] docs: Enhance documentation for iommu bypass

Message ID ZlglhDyJ9eupKC6K@ziqianlu-kbl (mailing list archive)
State New, archived
Headers show
Series [RFC,v2] docs: Enhance documentation for iommu bypass | expand

Commit Message

Aaron Lu May 30, 2024, 7:06 a.m. UTC
When Intel vIOMMU is used and irq remapping is enabled, using
bypass_iommu will cause following two callstacks dumped during guest kernel
boot(Linux x86_64) and all PCI devices attached to root bridge lose their MSI
capability and fall back to using IOAPIC:

[    0.960262] ------------[ cut here ]------------
[    0.961245] WARNING: CPU: 3 PID: 1 at drivers/pci/msi/msi.h:121 pci_msi_setup_msi_irqs+0x27/0x40
[    0.963070] Modules linked in:
[    0.963695] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.9.0-rc7-00056-g45db3ab70092 #1
[    0.965225] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    0.967382] RIP: 0010:pci_msi_setup_msi_irqs+0x27/0x40
[    0.968378] Code: 90 90 90 0f 1f 44 00 00 48 8b 87 30 03 00 00 89 f2 48 85 c0 74 14 f6 40 28 01 74 0e 48 81 c7 c0 00 00 00 31 f6 e9 29 42 9e ff <0f> 0b b8 ed ff ff ff c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00
[    0.971756] RSP: 0000:ffffc90000017988 EFLAGS: 00010246
[    0.972669] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    0.973901] RDX: 0000000000000005 RSI: 0000000000000005 RDI: ffff888100ee1000
[    0.975391] RBP: 0000000000000005 R08: ffff888101f44d90 R09: 0000000000000228
[    0.976629] R10: 0000000000000001 R11: 0000000000008d3f R12: ffffc90000017b80
[    0.977864] R13: ffff888102312000 R14: ffff888100ee1000 R15: 0000000000000005
[    0.979092] FS:  0000000000000000(0000) GS:ffff88817bd80000(0000) knlGS:0000000000000000
[    0.980473] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.981464] CR2: 0000000000000000 CR3: 000000000302e001 CR4: 0000000000770ef0
[    0.982687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.983919] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.985143] PKRU: 55555554
[    0.985625] Call Trace:
[    0.986056]  <TASK>
[    0.986440]  ? __warn+0x80/0x130
[    0.987014]  ? pci_msi_setup_msi_irqs+0x27/0x40
[    0.987810]  ? report_bug+0x18d/0x1c0
[    0.988443]  ? handle_bug+0x3a/0x70
[    0.989026]  ? exc_invalid_op+0x13/0x60
[    0.989672]  ? asm_exc_invalid_op+0x16/0x20
[    0.990374]  ? pci_msi_setup_msi_irqs+0x27/0x40
[    0.991118]  __pci_enable_msix_range+0x325/0x5b0
[    0.991883]  pci_alloc_irq_vectors_affinity+0xa9/0x110
[    0.992698]  vp_find_vqs_msix+0x1a8/0x4c0
[    0.993332]  vp_find_vqs+0x3a/0x1a0
[    0.993893]  vp_modern_find_vqs+0x17/0x70
[    0.994531]  init_vq+0x3ad/0x410
[    0.995051]  ? __pfx_default_calc_sets+0x10/0x10
[    0.995789]  virtblk_probe+0xeb/0xbc0
[    0.996362]  ? up_write+0x74/0x160
[    0.996900]  ? down_write+0x4d/0x80
[    0.997450]  virtio_dev_probe+0x1bc/0x270
[    0.998059]  really_probe+0xc1/0x390
[    0.998626]  ? __pfx___driver_attach+0x10/0x10
[    0.999288]  __driver_probe_device+0x78/0x150
[    0.999924]  driver_probe_device+0x1f/0x90
[    1.000506]  __driver_attach+0xce/0x1c0
[    1.001073]  bus_for_each_dev+0x70/0xc0
[    1.001638]  bus_add_driver+0x112/0x210
[    1.002191]  driver_register+0x55/0x100
[    1.002760]  virtio_blk_init+0x4c/0x90
[    1.003332]  ? __pfx_virtio_blk_init+0x10/0x10
[    1.003974]  do_one_initcall+0x41/0x240
[    1.004510]  ? kernel_init_freeable+0x240/0x4a0
[    1.005142]  kernel_init_freeable+0x321/0x4a0
[    1.005749]  ? __pfx_kernel_init+0x10/0x10
[    1.006311]  kernel_init+0x16/0x1c0
[    1.006798]  ret_from_fork+0x2d/0x50
[    1.007303]  ? __pfx_kernel_init+0x10/0x10
[    1.007883]  ret_from_fork_asm+0x1a/0x30
[    1.008431]  </TASK>
[    1.008748] ---[ end trace 0000000000000000 ]---

Another callstack happens at pci_msi_teardown_msi_irqs().

Actually every PCI device will trigger these two paths. There are only
two callstack dumps because the two places use WARN_ON_ONCE().

What happened is: when irq remapping is enabled, kernel expects all PCI
device(or its parent bridges) appear in some DMA Remapping Hardware unit
Definition(DRHD)'s device scope list and if not, this device's irq domain
will become NULL and that would make this device's MSI functionality
enabling fail.

Per my understanding, only virtualized system can have such a setup: irq
remapping enabled while not all PCI/PCIe devices appear in a DRHD's
device scope.

Enhance the document by mentioning what could happen when bypass_iommu
is used under Linux guest.

For detailed qemu cmdline and guest kernel dmesg, please see:
https://lore.kernel.org/qemu-devel/20240510072519.GA39314@ziqianlu-desk2/

Reported-by: Juro Bystricky <juro.bystricky@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
v2: make it clear this behaviour is Linux guest specific.

 docs/bypass-iommu.txt | 6 ++++++
 1 file changed, 6 insertions(+)
diff mbox series

Patch

diff --git a/docs/bypass-iommu.txt b/docs/bypass-iommu.txt
index e6677bddd3..fa80a5ce1f 100644
--- a/docs/bypass-iommu.txt
+++ b/docs/bypass-iommu.txt
@@ -68,6 +68,12 @@  devices might send malicious dma request to virtual machine if there is no
 iommu isolation. So it would be necessary to only bypass iommu for trusted
 device.
 
+When Intel IOMMU is virtualized, if irq remapping is enabled, PCI and PCIe
+devices that bypassed vIOMMU will have their MSI/MSI-x functionality disabled
+and fall back to IOAPIC under Linux x86_64 guest. If this is not desired,
+disable irq remapping with:
+qemu -device intel-iommu,intremap=off
+
 Implementation
 ==============
 The bypass iommu feature includes: