diff mbox

Unable to pass SATA controller to VM with intel_iommu=igfx_off

Message ID 20180108151932.23bb70ea@t450s.home (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Williamson Jan. 8, 2018, 10:19 p.m. UTC
On Mon, 8 Jan 2018 08:26:46 +0100
Binarus <lists@binarus.de> wrote:

> On 07.01.2018 07:22, Alex Williamson wrote:
> 
> > For the system you're using there is no integrated graphics on the
> > processor, so "igfx_off" does nothing.  However, using that as the only
> > parameter to intel_iommu= is basically the same as not providing an
> > intel_iommu= option at all, ie. the IOMMU is not enabled, thus...  
> 
> At first, thank you very much for spending time on that problem!
> 
> I see. This is the opposite logic than what I would have expected :-),
> and it explains the behavior (I would have expected that the option
> would be turned on if encountered in any case, and that inappropriate
> option parameters would be ignored instead of turning the option off).
> 
> Additionally, I have misunderstood the following passage
> 
> "If you encounter issues with graphics devices, you can try adding
> option intel_iommu=igfx_off to turn off the integrated graphics engine.
> If this fixes anything, please ensure you file a bug reporting the problem."
> 
> from the following document:
> 
> https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt
> 
> The wording "If this fixes anything" made me somehow think that igfx_off
> in rare cases might do something mysterious even to devices which are
> not integrated graphics devices.
> 
> Thank you very much for clearing that up and for explaining how
> intel_iommu works.
> 
> >> This message is to report that I can't pass through the SATA controller in question to a VM when the kernel has been booted with intel_iommu=igfx_off. Things are going wrong already during the boot process. dmesg | grep vfio says:
> >>
> >> [    1.074239] vfio-pci: probe of 0000:02:00.0 failed with error -22
> >> [    1.074309] vfio_pci: add [1b4b:9128[ffff:ffff]] class 0x000000/00000000  
> > 
> > vfio-pci won't probe the device because it doesn't have an IOMMU.  
> 
> This is Murphy's law. I have thought a long time about buying another
> controller with another chipset and obviously made the wrong choice. I
> should have read the controller's datasheet before buying, but
> understanding it would surely have cost me a huge amount of time. The
> controller was about 30 EUR (about 35 US$), so spending two days on
> understanding that datasheet wouldn't have made a lot of sense ...

I'm not even sure Marvell has figured out that they have this problem,
let alone documented it in a datasheet ;)

> > Marvell SATA controllers are notoriously bad with IOMMUs, they perform
> > DMA using a requester ID other than the actual PCI address of the
> > device.  For instance, the SATA controller is often at PCI function
> > address 0, but function address 1 is used for the requester ID.  The
> > IOMMU uses the requester ID to index the DMA translations for the
> > device, so getting this fundamental aspect of the transaction wrong in
> > hardware is often fatal for doing any DMA at all when the IOMMU is
> > enabled.  This should cause all sorts of "DMAR" faults during boot when
> > VT-d is enabled.  
> 
> Thanks for the in-depth explanation of the problem. I have never read
> the Marvell datasheets nor the IOMMU specs until now. I wish I had some
> more time; that whole thing is very interesting.
> 
> > We do have quirks to handle various Marvell chips, but not this one.
> > If you're interested in building a kernel, I can provide a patch to
> > test, though I suspect the reason we don't already have a patch for
> > this device might be conflicting results.  Your other option is to pick
> > a SATA controller that isn't so broken.  Thanks,  
> 
> Initially, I didn't want to build a new kernel on that machine (I don't
> have a second one at hands with identical hardware).
> 
> But I am quite impressed by the KVM project in general, and that you
> already can offer a patch for testing in particular, so I'll do it. I
> would be glad if I could be of any help to the KVM project. Could you
> please provide me instructions? Can your patch be applied to 4.8 or 4.9,
> or only to the newest kernel?

We already have quirks to support various other versions of the Marvell
chip, but the 9128 is missing, so it's just a couple lines to add it.
This is against v4.9.75:

 
> One additional question (to get my problem solved in any case): Could
> you recommend a SATA controller (chip) which works flawlessly in that
> respect? I just need at least one SATA channel (but better two) passed
> through into that VM - no raid, mirroring, striping or other fancy
> things needed.

Personally, I don't often assign storage controllers and they're mostly
all terrible.  The Marvell controllers nearly all have this DMA
aliasing issue afaik, but you can see in the code nearby the patch
above that we already have quirks for many of them.  For instance you
could buy a similarly priced Marvell 9230 card rather than a 9128 and
it might have worked since we've already got a quirk for it.  Sorry I
can't be more precise, even as the device assignment maintainer I
generally use virtio for VM disks and find it to be sufficiently fast
and feature-ful.  Thanks,

Alex

Comments

Binarus Jan. 9, 2018, 5:58 p.m. UTC | #1
At first, thank you very much!

On 08.01.2018 23:19, Alex Williamson wrote:

> We already have quirks to support various other versions of the Marvell
> chip, but the 9128 is missing, so it's just a couple lines to add it.
> This is against v4.9.75:
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 98eba9127a0b..19ca3c9fac3a 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3868,6 +3868,8 @@ static void quirk_dma_func1_alias(struct pci_dev *dev)
>  /* https://bugzilla.kernel.org/show_bug.cgi?id=42679#c49 */
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9230,
>  			 quirk_dma_func1_alias);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9128,
> +			 quirk_dma_func1_alias);
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_TTI, 0x0642,
>  			 quirk_dma_func1_alias);
>  /* https://bugs.gentoo.org/show_bug.cgi?id=497630 */

There is good news, and there is bad news:

The good news is that the patch works as expected. I have applied it to
kernel 4.9 and recompiled the kernel (which was not that easy for me
because this machine boots from ZFS, so beware of forgetting to
recompile and correctly include the ZFS modules as well into the new
kernel / initramfs ...).

I then have booted the new kernel with intel_iommu=on. The boot process
went normally - the AHCI / SATA driver now is behaving correctly when
initializing the controller in question.

I then have configured my system to let the vfio_pci kernel driver grab
that controller during the boot process, and have made sure that
vfio_pci gets loaded before the AHCI kernel driver.

That also worked well; dmesg |grep vfio was showing the expected output,
and lspci was showing that the controller indeed was under control of
vfio_pci.

But I couldn't get any further, and this is the bad news:

I have spent the rest of my day with trying to actually pass through the
controller to the VM in question. I am starting this VM by command line:

qemu_xxx <option>

My first step was to change the machine model from pc (the default) to
q35 because I thought it would be a good idea to use the default pcie.0
bus that model provides.

Since https://github.com/qemu/qemu/blob/master/docs/pcie.txt says that
we shouldn't connect PCIe devices directly to the pcie.0 bus, I have
then added

-device
ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=2,chassis=1,id=root.1

to the command line and booted the VM. This went normally, but of course
the OS in the VM did not find the controller because the above line only
adds a new PCIe root bus, but does not pass through the controller.
However, I am considering it noteworthy that this worked.

As the final step, I added

-device vfio-pci,host=02:00.0,bus=root.1,addr=00.0

to the command line.

For the rest of the day, I have tested any combination of vfio-pci and
ioh3420 options which I have found in various tutorials / threads and
which came to my mind. With every of these combinations, the VM shows
the same behavior:

The Seabios boot screen hangs for about a minute or so. Then the OS
(W2K8 R2 server 64 bit) hangs forever at the first screen which shows
the progress bar. By booting into safe mode, I have found out that this
happens when it tries to load the classpnp.sys driver.

In some cases, when starting the VM, there was a message on the console
saying it was disabling IRQ 16.

This is the point where I am lost (again).

I think I have done something very basic badly wrong; my interpretation
is that it does not find the bus topology it expects. What scares me is
that even the Seabios already hangs although the greatest parts of the
articles out there proposes exactly (more or less) what I am doing.

Could my (Debian stretch's) qemu be too old (it is 2.8.0)?

Or does quemu / vfio_pci have the same requester problem as the kernel?

What else could be the reason?

An example of a command line I have used:

/usr/bin/qemu-system-x86_64
-machine q35,accel=kvm
-cpu host
-smp cores=2,threads=2,sockets=1
-rtc base=localtime,clock=host,driftfix=none
-drive file=/vm-image/dax.img,format=raw,if=virtio,cache=writeback,index=0
-drive file=/dev/sda,format=raw,if=virtio,cache=none,index=1
-device
ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=2,chassis=1,id=root.1
-device vfio-pci,host=02:00.0,bus=root.1,addr=00.0
-boot c
-pidfile /root/qemu-kvm/qemu-dax.pid
-m 12288
-k de
-daemonize
-usb -usbdevice "tablet"
-name dax
-device virtio-net-pci,vlan=0,mac=02:01:01:01:02:01
-net
tap,vlan=0,name=dax,ifname=dax0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown
-vnc :2

> Personally, I don't often assign storage controllers and they're mostly
> all terrible.  The Marvell controllers nearly all have this DMA
> aliasing issue afaik, but you can see in the code nearby the patch
> above that we already have quirks for many of them.  For instance you
> could buy a similarly priced Marvell 9230 card rather than a 9128 and
> it might have worked since we've already got a quirk for it.  Sorry I
> can't be more precise, even as the device assignment maintainer I
> generally use virtio for VM disks and find it to be sufficiently fast
> and feature-ful.  Thanks,

Thank you very much - no problem. Just a short explanation: My issue is
not performance; instead, I need to be able to dynamically mount and
unmount ("eject") disks from within the VM (via the famous Windows tray
icon "Safely remove hardware").

Some days ago, Paolo Bonzini on this list has explained me how I could
achieve clean removal of HDDs from a VM, either using SCSI hotplug or
PCIe hotplug. Both suggestions were working at the first sight.

However, I am not sure if W2K8 R2 does reliably handle SCSI / PCIe
hotplug every time, and both methods require commands in the VM and in
the host system.

For my use case (changing a disk twice a day without restarting the VM)
this is too complicated and error prone; I really would like a solution
where I only need to eject the disk from within the Windows VM. If I
finally could pass through that (or another) SATA controller into the
VM, this problem would be solved the most elegant way.

Thank you very much again for any help,

Binarus
Binarus Jan. 9, 2018, 9:36 p.m. UTC | #2
To answer my own message:

On 09.01.2018 18:58, Binarus wrote:

> The Seabios boot screen hangs for about a minute or so. Then the OS
> (W2K8 R2 server 64 bit) hangs forever at the first screen which shows
> the progress bar. By booting into safe mode, I have found out that this
> happens when it tries to load the classpnp.sys driver.
> 
> In some cases, when starting the VM, there was a message on the console
> saying it was disabling IRQ 16.
> 
> This is the point where I am lost (again).

It seems I have got it to work. I have added the option
"x-no-kvm-intx=on" to the device definition. My command line is now:

/usr/bin/qemu-system-x86_64
 -machine q35,accel=kvm
 -cpu host
 -smp cores=2,threads=2,sockets=1
 -rtc base=localtime,clock=host,driftfix=none
 -drive file=/vm-image/dax.img,format=raw,if=virtio,cache=writeback,index=0
 -device
  ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=2,chassis=1,id=root.1
 -device vfio-pci,host=02:00.0,bus=root.1,addr=00.0,x-no-kvm-intx=on
 -boot c
 -pidfile /root/qemu-kvm/qemu-dax.pid
 -m 12288
 -k de
 -daemonize
 -usb -usbdevice "tablet"
 -name dax
 -device virtio-net-pci,vlan=0,mac=02:01:01:01:02:01
 -net
 tap,vlan=0,name=dax,ifname=dax0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown
 -vnc :2

This command line makes the Seabios hang for between 30 and 60 seconds
(it seems the time it takes is not always the same) during the boot
process, but then boots up the W2K8 R2 server without any issue. Within
the VM, I have installed the Marvell Windows drivers for the
controller's chipset. Great!

And as desired, I can now cleanly "eject" the disks connected to that
controller without leaving the VM, i.e. without visiting the host's console.

Remaining questions:

- What could make the Seabios hang for such a long time upon every boot?

- Could you please shortly explain what the option "x-no-kvm-intx=on"
does and why I need it in this case?

- Could you please shortly explain what exactly it wants to tell me when
it says that it disables INT xx, and notable if this is a bad thing I
should take care of?

- What about the "x-no-kvm-msi" and "x-no-kvm-msix" options? Would it be
better to use them as well? I couldn't find any sound information about
what exactly they do (Note: Initially, I had all three of those
"x-no..." options active, which made the VM boot the first time, and
later out of curiosity found out that "x-no-kvm-intx" is the essential
one. Without this one, the VM won't boot; the other two don't seem to
change anything in my case).

- Could we expect your patch to go into upstream (perhaps after the
above issues / questions have been investigated)? I will try to convince
the Debian people to include the patch into 4.9; if they refuse, I will
have to compile a new kernel each time they release one, which happens
quite often (probably security fixes) since some time ...

Thank you very much again,

Binarus
Alex Williamson Jan. 9, 2018, 10:41 p.m. UTC | #3
On Tue, 9 Jan 2018 22:36:01 +0100
Binarus <lists@binarus.de> wrote:

> To answer my own message:
> 
> On 09.01.2018 18:58, Binarus wrote:
> 
> > The Seabios boot screen hangs for about a minute or so. Then the OS
> > (W2K8 R2 server 64 bit) hangs forever at the first screen which shows
> > the progress bar. By booting into safe mode, I have found out that this
> > happens when it tries to load the classpnp.sys driver.
> > 
> > In some cases, when starting the VM, there was a message on the console
> > saying it was disabling IRQ 16.
> > 
> > This is the point where I am lost (again).  
> 
> It seems I have got it to work. I have added the option
> "x-no-kvm-intx=on" to the device definition. My command line is now:
> 
> /usr/bin/qemu-system-x86_64
>  -machine q35,accel=kvm
>  -cpu host
>  -smp cores=2,threads=2,sockets=1
>  -rtc base=localtime,clock=host,driftfix=none
>  -drive file=/vm-image/dax.img,format=raw,if=virtio,cache=writeback,index=0
>  -device
>   ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=2,chassis=1,id=root.1
>  -device vfio-pci,host=02:00.0,bus=root.1,addr=00.0,x-no-kvm-intx=on
>  -boot c
>  -pidfile /root/qemu-kvm/qemu-dax.pid
>  -m 12288
>  -k de
>  -daemonize
>  -usb -usbdevice "tablet"
>  -name dax
>  -device virtio-net-pci,vlan=0,mac=02:01:01:01:02:01
>  -net
>  tap,vlan=0,name=dax,ifname=dax0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown
>  -vnc :2
> 
> This command line makes the Seabios hang for between 30 and 60 seconds
> (it seems the time it takes is not always the same) during the boot
> process, but then boots up the W2K8 R2 server without any issue. Within
> the VM, I have installed the Marvell Windows drivers for the
> controller's chipset. Great!
> 
> And as desired, I can now cleanly "eject" the disks connected to that
> controller without leaving the VM, i.e. without visiting the host's console.
> 
> Remaining questions:
> 
> - What could make the Seabios hang for such a long time upon every boot?

Perhaps some sort of problem with the device ROM.  Assuming you're not
booting the VM from the assigned device, you can add rombar=0 to the
qemu vfio-pci device options to disable the ROM.  I suppose it's
possible that SeaBIOS might know how to talk to the device regardless
of the ROM, so no guarantees that will resolve it.  Setting a bootindex
both on the vfio-pci device and the actual boot device could help.  I
think the '-boot c' option is deprecated, explicitly specifying a
emulated controller would be better.  virt-install or virt-manager
would do this for you.  Also, using q35 vs 440fx for the VM machine
type makes no difference, q35 is, if anything, more troublesome imo.

> - Could you please shortly explain what the option "x-no-kvm-intx=on"
> does and why I need it in this case?

INTx is the legacy PCI interrupt (ie. INTA, INTB, INTC, INTD).  This is
a level triggered interrupt therefore it continues to assert until the
device is serviced.  It must therefore be masked on the host while it
is handled by the guest.  There are two paths we can use for injecting
this interrupt into the VM and unmasking it on the host once the VM
samples the interrupt.  When KVM is used for acceleration, these happen
via direct connection between the vfio-pci and kvm modules using
eventfds and irqfds.  The x-no-kvm-intx option disables that path,
instead bouncing out to QEMU to do the same.

TBH, I have no idea why this would make it work.  The QEMU path is
slower than the KVM path, but they should be functionally identical.

> - Could you please shortly explain what exactly it wants to tell me when
> it says that it disables INT xx, and notable if this is a bad thing I
> should take care of?

The "Disabling IRQ XX, nobody cared" message means that the specified
IRQ asserted many times without any of the interrupt handlers claiming
that it was their device asserting it.  It then masks the interrupt at
the APIC.  With device assignment this can mean that the mechanism we
use to mask the device doesn't work for that device.  There's a
vfio-pci module option you can use to have vfio-pci mask the interrupt
at the APIC rather than the device, nointxmask=1.  The trouble with
this option is that it can only be used with exclusive interrupts, so
if any other devices share the interrupt, starting the VM will fail.
As a test, you can unbind conflicting devices from their drivers
(assuming non-critical devices).

The troublesome point here is that regardless of x-no-kvm-intx, the
kernel uses the same masking technique for the device, so it's unclear
why one works and the other does not.
 
> - What about the "x-no-kvm-msi" and "x-no-kvm-msix" options? Would it be
> better to use them as well? I couldn't find any sound information about
> what exactly they do (Note: Initially, I had all three of those
> "x-no..." options active, which made the VM boot the first time, and
> later out of curiosity found out that "x-no-kvm-intx" is the essential
> one. Without this one, the VM won't boot; the other two don't seem to
> change anything in my case).

Similar to the INTx version, they route the interrupts out through QEMU
rather than inject them through a side channel with KVM.  They're just
slower.  Generally these options are only used for debugging as they
make the interrupts visible to QEMU, functionality is generally not
affected.

What interrupt mode does the device operate in once the VM is running?
You can run 'lspci -vs <device address>' on the host and see something
like:

	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [70] MSI-X: Enable+ Count=10 Masked-

In this case the Enable+ shows the device is using MSI-X rather than
MSI, which shows Enable-.  The device might not support both (or
either).  If none are Enable+, legacy interrupts are probably being
used.

Often legacy interrupts are only used at boot and then the device
switches to MSI/X.  If that's the case for this device, x-no-kvm-intx
doesn't really hurt you runtime.

> - Could we expect your patch to go into upstream (perhaps after the
> above issues / questions have been investigated)? I will try to convince
> the Debian people to include the patch into 4.9; if they refuse, I will
> have to compile a new kernel each time they release one, which happens
> quite often (probably security fixes) since some time ...

I would not recommend trying to convince Debian to take a non-upstream
patch, the process is that I need to do more research to figure out
why this device isn't already quirked, I'm sure others have complained,
but did similar patches make things worse for them or did they simply
disappear.  Can you confirm whether the device behaves properly for
host use with the patch?  Issues with assigning the device could be
considered secondary if the host behavior is obviously improved.
Alternatively, the 9230, or various others in that section of the
quirk code, are already quirked, so you can decide if picking a
different $30 card is a better option for you ;) Thanks,

Alex
Binarus Jan. 10, 2018, 10:14 a.m. UTC | #4
Thank you very much for the detailed and invaluable information!

In the meantime, it has turned out that host and VM are stable, but that
performance is a disaster. Therefore, the success is a pyrrhic victory.
I have connected two disks to the controller and copied a large file
between them from within the VM. The speed was about 3 MB/s. Of course,
this does not make any sense.

In any case, I will follow your advice and buy another adapter card,
probably with the ADM1061. But it still would be interesting (hopefully)
to figure out what is going on here. Thus, ...

On 09.01.2018 23:41, Alex Williamson wrote:
>> Remaining questions:
>>
>> - What could make the Seabios hang for such a long time upon every boot?
> 
> Perhaps some sort of problem with the device ROM.  Assuming you're not
> booting the VM from the assigned device, you can add rombar=0 to the
> qemu vfio-pci device options to disable the ROM.

I now have tried that. Sadly, rombar=0 did not change anything. Seabios
still hangs during boot for a minute or so, then the VM boots up without
problems. Seabios hangs whether or not disks are connected to the
controller.

> Setting a bootindex
> both on the vfio-pci device and the actual boot device could help.

Unfortunately, setting the bootindex on the actual boot device is not
possible since the boot device's image format is raw. Trying to set a
bootindex makes qemu emit the following error message upon start:

"[...] Block format 'raw' does not support the option 'bootindex'"

I have then set the bootindex of the vfio device to 9; that did not
change anything. Additionally, I have tried -boot strict=on; that didn't
change anything as well.

I think I can remember a message from you on another list (or maybe the
same) where you were helping a person with a similar problem. If memory
serves me, you were suggesting that the Seabios might be too old. Could
that be the case for me, too?

> I
> think the '-boot c' option is deprecated, explicitly specifying a
> emulated controller would be better.

I have re-read qemu's manual for my host system, and of course, you are
right :-) I'll try to figure out how to set the boot order in a
non-deprecated fashion (but still without using bootindex).

>  Also, using q35 vs 440fx for the VM machine
> type makes no difference, q35 is, if anything, more troublesome imo.

This is interesting. I have re-tested and confirmed my initial findings:
When I use -machine pc,... instead of -machine q35,..., qemu emits the
following error when starting:

-device
ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=2,chassis=1,id=root.1:
Bus 'pcie.0' not found

This is one of the few things I thought I had understood. According to
my research, the q35 model establishes a root PCI Express bus by default
(pcie.0), while the pc (= 440fx) model establishes only a root PCI bus
by default (pci.0).

The device which I would like to pass through is a PCI-E device.
According to https://github.com/qemu/qemu/blob/master/docs/pcie.txt (as
far as I have understood it), we should put PCI-E devices only on PCI-E
buses, but not on PCI buses.

If I would use -machine pc, there would be only a PCI (non-Express) root
bus, and although we could plug in the pass-through device there, we
shouldn't do it (or should we?). Did I get this wrong?

>> - Could you please shortly explain what the option "x-no-kvm-intx=on"
>> does and why I need it in this case?
> 
> INTx is the legacy PCI interrupt (ie. INTA, INTB, INTC, INTD).  This is
> a level triggered interrupt therefore it continues to assert until the
> device is serviced.  It must therefore be masked on the host while it
> is handled by the guest.  There are two paths we can use for injecting
> this interrupt into the VM and unmasking it on the host once the VM
> samples the interrupt.  When KVM is used for acceleration, these happen
> via direct connection between the vfio-pci and kvm modules using
> eventfds and irqfds.  The x-no-kvm-intx option disables that path,
> instead bouncing out to QEMU to do the same.

I see. Thank you very much for explaining so clearly.

> TBH, I have no idea why this would make it work.  The QEMU path is
> slower than the KVM path, but they should be functionally identical.

Eventually the device design is indeed so badly broken that the
functional identity might not be given in that case. I suppose that the
difference in speed between the two paths is not great enough to explain
the extremely slow data transfer in the VM?

>> - Could you please shortly explain what exactly it wants to tell me when
>> it says that it disables INT xx, and notable if this is a bad thing I
>> should take care of?
> 
> The "Disabling IRQ XX, nobody cared" message means that the specified
> IRQ asserted many times without any of the interrupt handlers claiming
> that it was their device asserting it.  It then masks the interrupt at
> the APIC.  With device assignment this can mean that the mechanism we
> use to mask the device doesn't work for that device.  There's a
> vfio-pci module option you can use to have vfio-pci mask the interrupt
> at the APIC rather than the device, nointxmask=1.  The trouble with
> this option is that it can only be used with exclusive interrupts, so
> if any other devices share the interrupt, starting the VM will fail.
> As a test, you can unbind conflicting devices from their drivers
> (assuming non-critical devices).

Again, thank you very much for the clear explanation. I'll investigate
and report back in a few hours.

> The troublesome point here is that regardless of x-no-kvm-intx, the
> kernel uses the same masking technique for the device, so it's unclear
> why one works and the other does not.

>> - What about the "x-no-kvm-msi" and "x-no-kvm-msix" options? Would it be
>> better to use them as well? I couldn't find any sound information about
>> what exactly they do (Note: Initially, I had all three of those
>> "x-no..." options active, which made the VM boot the first time, and
>> later out of curiosity found out that "x-no-kvm-intx" is the essential
>> one. Without this one, the VM won't boot; the other two don't seem to
>> change anything in my case).
> 
> Similar to the INTx version, they route the interrupts out through QEMU
> rather than inject them through a side channel with KVM.  They're just
> slower.  Generally these options are only used for debugging as they
> make the interrupts visible to QEMU, functionality is generally not
> affected.

Thank you very much - got it.

> What interrupt mode does the device operate in once the VM is running?
> You can run 'lspci -vs <device address>' on the host and see something
> like:
> 
> 	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> 	Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
> 
> In this case the Enable+ shows the device is using MSI-X rather than
> MSI, which shows Enable-.  The device might not support both (or
> either).  If none are Enable+, legacy interrupts are probably being
> used.

It says:

...
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit-
...
Capabilities: [70] Express (v2) Legacy Endpoint, MSI 00
...

Nothing else containing the string "MSI" is in the output.

> Often legacy interrupts are only used at boot and then the device
> switches to MSI/X.  If that's the case for this device, x-no-kvm-intx
> doesn't really hurt you runtime.

>> - Could we expect your patch to go into upstream (perhaps after the
>> above issues / questions have been investigated)? I will try to convince
>> the Debian people to include the patch into 4.9; if they refuse, I will
>> have to compile a new kernel each time they release one, which happens
>> quite often (probably security fixes) since some time ...
> 
> I would not recommend trying to convince Debian to take a non-upstream
> patch, the process is that I need to do more research to figure out
> why this device isn't already quirked, I'm sure others have complained,
> but did similar patches make things worse for them or did they simply
> disappear. Can you confirm whether the device behaves properly for
> host use with the patch?  Issues with assigning the device could be
> considered secondary if the host behavior is obviously improved.

I can definitely confirm that the patch vastly improves behavior for the
host. As I have described in my first message, without the patch and
with intel_iommu=on, the boot process hung for a minute or so when the
kernel tried to initialize that controller, thereby obviously hitting
timeouts and spitting out error messages multiple times. The two most
relevant messages were that the SATA link speed would be reduced (saying
one time to 3.0 Gbps and the next time to 1.5 Gbps, repeating multiple
times), for both channels, and that the disk(s) could not be identified
(if a disk(s) was (were) connected). This applies to both channels, and
consequently, the respective block devices were missing after the boot
process had finished.

I have verified this behavior multiple times with the controller card
connected to different slots, with and without HDDs connected, and after
cold boots as well as after warm boots.

There were no issues when the kernel parameter intel_iommu had *not*
been given.

With your patch applied, the system boots up without any problem whether
or not the intel_iommu=on is given. I have verified this multiple times,
putting the controller in different slots. In every case, the boot
process went normally, and the disks connected to the controller had
become block devices as expected once the system had finished booting.

Likewise, I have tested the behavior with the patched kernel, but
without the intel_iommu parameter. I did not notice any problems.

All tests were done with the Debian 4.9.0 kernel with Debian patches
(version 65). When patching the kernel, I have downloaded the Debian
kernel source package, unpacked it, copied the config from the stock
kernel, applied your patch and then compiled.

During yesterday's research, I had the system running without passing
through that controller most time (because pass-through didn't work
yet); instead, I had passed through two disks (i.e. the block devices)
connected to the controller via virtio into the VM in question. I did
not notice any problem or misbehavior. This is production (VM) server,
so I surely had noticed if there had been problems :-)

In summary, despite the short testing time, we can conclude:

1) Your patch only affects people with a Marvel ...9128 SATA chipset.

2) People without intel_iommu=on do not benefit from your patch, and are
not hurt by it.

3) People with intel_iommu=on and a stock kernel will not be able to
boot cleanly if that SATA chip is in the system; the disks connected to
that chip probably won't be recognized (as in my case); if this happens
nevertheless, it probably would be dangerous to use them.

4) People with the patched kernel will be able to use that controller
without any problem, whether intel_iommu=on is given or not; at least, I
can definitely confirm that the boot problems are being solved by that
patch.

Long term stability should be further tested. Although I am personally
convinced and will use the controller in production (either for pass
through if I can make it work in terms of performance or in another
machine for the host system), I do not take the responsibility. I am
just reporting my personal experience.

> Alternatively, the 9230, or various others in that section of the
> quirk code, are already quirked, so you can decide if picking a
> different $30 card is a better option for you ;) Thanks,

Perhaps I'll even buy two different ones: One with the 9230 (but
seriously wondering why its design should be less flawed than that of
the 9128), and one with the ADM1061 (hoping there is at least one
company which did it right - getting Windows driver for that one could
be a nightmare, though).

Thank you very much again,

Binarus
Binarus Jan. 10, 2018, 4:40 p.m. UTC | #5
Alex, thank you! I think I have solved the performance problem and have
made some interesting observations.

On 09.01.2018 23:41, Alex Williamson wrote:
>> - Could you please shortly explain what exactly it wants to tell me when
>> it says that it disables INT xx, and notable if this is a bad thing I
>> should take care of?
> 
> The "Disabling IRQ XX, nobody cared" message means that the specified
> IRQ asserted many times without any of the interrupt handlers claiming
> that it was their device asserting it.  It then masks the interrupt at
> the APIC.  With device assignment this can mean that the mechanism we
> use to mask the device doesn't work for that device.  There's a
> vfio-pci module option you can use to have vfio-pci mask the interrupt
> at the APIC rather than the device, nointxmask=1.  The trouble with
> this option is that it can only be used with exclusive interrupts, so
> if any other devices share the interrupt, starting the VM will fail.
> As a test, you can unbind conflicting devices from their drivers
> (assuming non-critical devices).

This statement has put me on the right track:

First, I rebooted the machine without vfio_pci and looked into
/proc/interrupts. The SATA controller in question was bound to INT 37
and was the *only* device using that INT.

I then rebooted with vfio_pci active and tried to start the VM, passing
through the SATA controller to it. As described in my previous messages,
the console showed an error message saying that it disabled INT 16 (!)
when starting the VM.

I looked into /proc/interrupts again and noticed that INT 16 was bound
to one of the USB ports, and that this was the only device using INT 16.

Then I added nointxmask=1 to vfio_pci's options, made depmod and updated
the initramfs and kept this setting for all further experiments.

After having rebooted, I removed all "x-no-" options (the ones we talked
about recently) from the device definitions of the VM. Then I unbound
the USB port in question (i.e. the one which used INT 16) from its
driver. Although lspci was still claiming that this USB port was using
INT 16, /proc/interrupts showed that INT 16 was not bound to a driver
any more.

Then I started the VM. The console did not show any messages any more,
the VM booted without any issue, *and SATA speed was back to normal
again* (100 MB/s with nointxmask=1 and that USB port unbound versus 2
MB/s without nointxmask and without unbinding that USB port).

I have lost one USB port, but finally have full SATA hardware in the VM.
I can very well live with the lost USB port because there are plenty of
them, and it was USB 1.1 anyway. I will stick with this configuration
for the next time.

*And here is the interesting (from my naive point of view) part which
might explain what happened:*

/proc/interrupts (with the VM running!) shows that *vfio-intx is using
INT 16* now. KVM / Quemu obviously had the idea to assign INT 16 to the
vfio device *although* INT 16 was already bound to a USB port which was
active in the host, and *although* the device which is passed through
would be at INT 37 if vfio_pci would not be active.

Therefore, the console was showing the error message regarding INT 16;
obviously, the kernel / KVM / QEMU could not handle the interrupt
sharing between the host USB port and the vfio_pci device which KVM /
QEMU had made necessary.

By the way, this is the only vfio_pci device on this machine.

Should we consider this behavior a bug? Why does a vfio_pci device get
bound to an interrupt which is bound to another hardware device on the
host? Do we have any chance to influence that (modinfo vfio_pci does not
show any parameter related to interrupt numbers)?

>> - Could we expect your patch to go into upstream (perhaps after the
>> above issues / questions have been investigated)? I will try to convince
>> the Debian people to include the patch into 4.9; if they refuse, I will
>> have to compile a new kernel each time they release one, which happens
>> quite often (probably security fixes) since some time ...
> 
> I would not recommend trying to convince Debian to take a non-upstream
> patch, the process is that I need to do more research to figure out
> why this device isn't already quirked, I'm sure others have complained,
> but did similar patches make things worse for them or did they simply
> disappear.  Can you confirm whether the device behaves properly for
> host use with the patch?  Issues with assigning the device could be
> considered secondary if the host behavior is obviously improved.
> Alternatively, the 9230, or various others in that section of the
> quirk code, are already quirked, so you can decide if picking a
> different $30 card is a better option for you ;) Thanks,

I am not sure if the interrupt conflict between the USB port and
vfio_pci is related to that chipset in particular. I guess (it's really
that: a guess) that KVM or QEMU do not assign an appropriate interrupt
number to vfio_pci devices under certain circumstances. If this is the
case, it could happen with other controllers / chipsets of all kind as well.

Thus, I assume we have that controller running now. If you are
interested, I will test for a while and report back if it is stable; I
would like to keep it passed through into the VM, though, so I can't
test if it is stable for the host. However, if the letter is a high
priority thing for you, I'll revert the configuration and let it run in
the host for a week or so.

Regards and many thanks,

Binarus
diff mbox

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 98eba9127a0b..19ca3c9fac3a 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3868,6 +3868,8 @@  static void quirk_dma_func1_alias(struct pci_dev *dev)
 /* https://bugzilla.kernel.org/show_bug.cgi?id=42679#c49 */
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9230,
 			 quirk_dma_func1_alias);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9128,
+			 quirk_dma_func1_alias);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_TTI, 0x0642,
 			 quirk_dma_func1_alias);
 /* https://bugs.gentoo.org/show_bug.cgi?id=497630 */