Message ID | 20230128013951.523247-1-damien.lemoal@opensource.wdc.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | pci: Avoid FLR for AMD FCH AHCI adapters | expand |
[+cc Mario, Shyam, Brijesh] On Sat, Jan 28, 2023 at 10:39:51AM +0900, Damien Le Moal wrote: > PCI passthrough to VMs does not work with AMD FCH AHCI adapters: the > guest OS fails to correctly probe devices attached to the controller due > to FIS communication failures. What does a FIS communication failure look like? Can we include a line or two of dmesg output here to help users find this fix? AMD folks: Can you confirm/deny that this is a hardware erratum in this device? Do you know of any other devices that need a similar workaround? > Forcing the "bus" reset method before > unbinding & binding the adapter to the vfio-pci driver solves this > issue. I.e.: > > echo "bus" > /sys/bus/pci/devices/<ID>/reset_method > > gives a working guest OS, thus indicating that the default flr reset > method is defective, resulting in the adapter not being reset correctly. > > This patch applies the no_flr quirk to AMD FCH AHCI devices to > permanently solve this issue. > > Reported-by: Niklas Cassel <niklas.cassel@wdc.com> > Cc: stable@vger.kernel.org > Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> > --- > drivers/pci/quirks.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 285acc4aaccc..20ac67d59034 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -5340,6 +5340,7 @@ static void quirk_no_flr(struct pci_dev *dev) > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_no_flr); > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x148c, quirk_no_flr); > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_no_flr); > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x7901, quirk_no_flr); > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr); > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr); > > -- > 2.39.1 >
On Mon, Jan 30, 2023 at 09:21:11AM -0600, Bjorn Helgaas wrote: > [+cc Mario, Shyam, Brijesh] > > On Sat, Jan 28, 2023 at 10:39:51AM +0900, Damien Le Moal wrote: > > PCI passthrough to VMs does not work with AMD FCH AHCI adapters: the > > guest OS fails to correctly probe devices attached to the controller due > > to FIS communication failures. > > What does a FIS communication failure look like? Can we include a > line or two of dmesg output here to help users find this fix? Hello Bjorn, It looks like this: [ 22.402368] ata4: softreset failed (1st FIS failed) [ 32.417855] ata4: softreset failed (1st FIS failed) [ 67.441641] ata4: softreset failed (1st FIS failed) [ 67.453227] ata4: limiting SATA link speed to 3.0 Gbps [ 72.661738] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320) [ 78.121263] ata4.00: qc timeout after 5000 msecs (cmd 0xec) [ 78.134413] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4) Basically, we can read and write MMIO registers in the AHCI HBA, but the communication between the AHCI HBA and the ATA device does not work properly. (Because the AHCI HBA did not get reset when binding/unbinding the device.) The exact same kernel, using the same generic AHCI driver within the VM, can communicate perfectly fine when using e.g. an Intel AHCI HBA. (With both the AMD and Intel AHCI HBAs being bound to the vfio-pci driver in the host.) We can send a v2 with the above dmesg output. Kind regards, Niklas > > AMD folks: Can you confirm/deny that this is a hardware erratum in > this device? Do you know of any other devices that need a similar > workaround? > > > Forcing the "bus" reset method before > > unbinding & binding the adapter to the vfio-pci driver solves this > > issue. I.e.: > > > > echo "bus" > /sys/bus/pci/devices/<ID>/reset_method > > > > gives a working guest OS, thus indicating that the default flr reset > > method is defective, resulting in the adapter not being reset correctly. > > > > This patch applies the no_flr quirk to AMD FCH AHCI devices to > > permanently solve this issue. > > > > Reported-by: Niklas Cassel <niklas.cassel@wdc.com> > > Cc: stable@vger.kernel.org > > Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> > > --- > > drivers/pci/quirks.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > index 285acc4aaccc..20ac67d59034 100644 > > --- a/drivers/pci/quirks.c > > +++ b/drivers/pci/quirks.c > > @@ -5340,6 +5340,7 @@ static void quirk_no_flr(struct pci_dev *dev) > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_no_flr); > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x148c, quirk_no_flr); > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_no_flr); > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x7901, quirk_no_flr); > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr); > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr); > > > > -- > > 2.39.1 > >
On Mon, Jan 30, 2023 at 03:46:06PM +0000, Niklas Cassel wrote: > On Mon, Jan 30, 2023 at 09:21:11AM -0600, Bjorn Helgaas wrote: > > On Sat, Jan 28, 2023 at 10:39:51AM +0900, Damien Le Moal wrote: > > > PCI passthrough to VMs does not work with AMD FCH AHCI adapters: the > > > guest OS fails to correctly probe devices attached to the controller due > > > to FIS communication failures. > > > > What does a FIS communication failure look like? Can we include a > > line or two of dmesg output here to help users find this fix? > > It looks like this: > > [ 22.402368] ata4: softreset failed (1st FIS failed) > [ 32.417855] ata4: softreset failed (1st FIS failed) > [ 67.441641] ata4: softreset failed (1st FIS failed) > [ 67.453227] ata4: limiting SATA link speed to 3.0 Gbps > [ 72.661738] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320) > [ 78.121263] ata4.00: qc timeout after 5000 msecs (cmd 0xec) > [ 78.134413] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4) > > Basically, we can read and write MMIO registers in the AHCI HBA, > but the communication between the AHCI HBA and the ATA device does > not work properly. > > (Because the AHCI HBA did not get reset when binding/unbinding the > device.) > > The exact same kernel, using the same generic AHCI driver within the VM, > can communicate perfectly fine when using e.g. an Intel AHCI HBA. > > (With both the AMD and Intel AHCI HBAs being bound to the vfio-pci driver > in the host.) > > We can send a v2 with the above dmesg output. Don't bother, I added the above and applied this to pci/virtualization for v6.2, thanks! > > AMD folks: Can you confirm/deny that this is a hardware erratum in > > this device? Do you know of any other devices that need a similar > > workaround? > > > > > Forcing the "bus" reset method before > > > unbinding & binding the adapter to the vfio-pci driver solves this > > > issue. I.e.: > > > > > > echo "bus" > /sys/bus/pci/devices/<ID>/reset_method > > > > > > gives a working guest OS, thus indicating that the default flr reset > > > method is defective, resulting in the adapter not being reset correctly. > > > > > > This patch applies the no_flr quirk to AMD FCH AHCI devices to > > > permanently solve this issue. > > > > > > Reported-by: Niklas Cassel <niklas.cassel@wdc.com> > > > Cc: stable@vger.kernel.org > > > Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> > > > --- > > > drivers/pci/quirks.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > > index 285acc4aaccc..20ac67d59034 100644 > > > --- a/drivers/pci/quirks.c > > > +++ b/drivers/pci/quirks.c > > > @@ -5340,6 +5340,7 @@ static void quirk_no_flr(struct pci_dev *dev) > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_no_flr); > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x148c, quirk_no_flr); > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_no_flr); > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x7901, quirk_no_flr); > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr); > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr); > > > > > > -- > > > 2.39.1 > > >
On 1/30/2023 8:51 PM, Bjorn Helgaas wrote: > [+cc Mario, Shyam, Brijesh] > > On Sat, Jan 28, 2023 at 10:39:51AM +0900, Damien Le Moal wrote: >> PCI passthrough to VMs does not work with AMD FCH AHCI adapters: the >> guest OS fails to correctly probe devices attached to the controller due >> to FIS communication failures. > > What does a FIS communication failure look like? Can we include a > line or two of dmesg output here to help users find this fix? > > AMD folks: Can you confirm/deny that this is a hardware erratum in > this device? Do you know of any other devices that need a similar > workaround? Niklas, can you send the list of AHCI device id present on your system? perhaps a lspci output? Based on that I can talk to the FCH HW design folks to know if there is erratum present for those device(s). Thanks, Shyam > >> Forcing the "bus" reset method before >> unbinding & binding the adapter to the vfio-pci driver solves this >> issue. I.e.: >> >> echo "bus" > /sys/bus/pci/devices/<ID>/reset_method >> >> gives a working guest OS, thus indicating that the default flr reset >> method is defective, resulting in the adapter not being reset correctly. >> >> This patch applies the no_flr quirk to AMD FCH AHCI devices to >> permanently solve this issue. >> >> Reported-by: Niklas Cassel <niklas.cassel@wdc.com> >> Cc: stable@vger.kernel.org >> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> >> --- >> drivers/pci/quirks.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c >> index 285acc4aaccc..20ac67d59034 100644 >> --- a/drivers/pci/quirks.c >> +++ b/drivers/pci/quirks.c >> @@ -5340,6 +5340,7 @@ static void quirk_no_flr(struct pci_dev *dev) >> DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_no_flr); >> DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x148c, quirk_no_flr); >> DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_no_flr); >> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x7901, quirk_no_flr); >> DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr); >> DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr); >> >> -- >> 2.39.1 >>
On Mon, Jan 30, 2023 at 09:42:36PM +0530, Shyam Sundar S K wrote: > > > On 1/30/2023 8:51 PM, Bjorn Helgaas wrote: > > [+cc Mario, Shyam, Brijesh] > > > > On Sat, Jan 28, 2023 at 10:39:51AM +0900, Damien Le Moal wrote: > >> PCI passthrough to VMs does not work with AMD FCH AHCI adapters: the > >> guest OS fails to correctly probe devices attached to the controller due > >> to FIS communication failures. > > > > What does a FIS communication failure look like? Can we include a > > line or two of dmesg output here to help users find this fix? > > > > AMD folks: Can you confirm/deny that this is a hardware erratum in > > this device? Do you know of any other devices that need a similar > > workaround? > > Niklas, can you send the list of AHCI device id present on your system? > perhaps a lspci output? Of course, here you go: # lspci -vvvnns 49:00.0 49:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51) (prog-if 01 [AHCI 1.0]) Subsystem: Super Micro Computer Inc H12SSL-i [15d9:7901] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 76 IOMMU group: 48 Region 5: Memory at b1500000 (32-bit, non-prefetchable) [size=2K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [64] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 16GT/s, Width x16 TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [a0] MSI: Enable+ Count=16/16 Maskable- 64bit+ Address: 00000000fee00000 Data: 0000 Capabilities: [d0] SATA HBA v1.0 InCfgSpace Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [270 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [2a0 v1] Access Control Services ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Capabilities: [400 v1] Data Link Feature <?> Capabilities: [410 v1] Physical Layer 16.0 GT/s <?> Capabilities: [440 v1] Lane Margining at the Receiver <?> Kernel driver in use: ahci Kind regards, Niklas
On 1/31/23 01:04, Bjorn Helgaas wrote: > On Mon, Jan 30, 2023 at 03:46:06PM +0000, Niklas Cassel wrote: >> On Mon, Jan 30, 2023 at 09:21:11AM -0600, Bjorn Helgaas wrote: >>> On Sat, Jan 28, 2023 at 10:39:51AM +0900, Damien Le Moal wrote: >>>> PCI passthrough to VMs does not work with AMD FCH AHCI adapters: the >>>> guest OS fails to correctly probe devices attached to the controller due >>>> to FIS communication failures. >>> >>> What does a FIS communication failure look like? Can we include a >>> line or two of dmesg output here to help users find this fix? >> >> It looks like this: >> >> [ 22.402368] ata4: softreset failed (1st FIS failed) >> [ 32.417855] ata4: softreset failed (1st FIS failed) >> [ 67.441641] ata4: softreset failed (1st FIS failed) >> [ 67.453227] ata4: limiting SATA link speed to 3.0 Gbps >> [ 72.661738] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320) >> [ 78.121263] ata4.00: qc timeout after 5000 msecs (cmd 0xec) >> [ 78.134413] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4) >> >> Basically, we can read and write MMIO registers in the AHCI HBA, >> but the communication between the AHCI HBA and the ATA device does >> not work properly. >> >> (Because the AHCI HBA did not get reset when binding/unbinding the >> device.) >> >> The exact same kernel, using the same generic AHCI driver within the VM, >> can communicate perfectly fine when using e.g. an Intel AHCI HBA. >> >> (With both the AMD and Intel AHCI HBAs being bound to the vfio-pci driver >> in the host.) >> >> We can send a v2 with the above dmesg output. > > Don't bother, I added the above and applied this to pci/virtualization > for v6.2, thanks! Thanks !
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 285acc4aaccc..20ac67d59034 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -5340,6 +5340,7 @@ static void quirk_no_flr(struct pci_dev *dev) DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_no_flr); DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x148c, quirk_no_flr); DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_no_flr); +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x7901, quirk_no_flr); DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr); DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr);
PCI passthrough to VMs does not work with AMD FCH AHCI adapters: the guest OS fails to correctly probe devices attached to the controller due to FIS communication failures. Forcing the "bus" reset method before unbinding & binding the adapter to the vfio-pci driver solves this issue. I.e.: echo "bus" > /sys/bus/pci/devices/<ID>/reset_method gives a working guest OS, thus indicating that the default flr reset method is defective, resulting in the adapter not being reset correctly. This patch applies the no_flr quirk to AMD FCH AHCI devices to permanently solve this issue. Reported-by: Niklas Cassel <niklas.cassel@wdc.com> Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> --- drivers/pci/quirks.c | 1 + 1 file changed, 1 insertion(+)