Message ID | 20210709145831.6123-3-verdre@v0yd.nl (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Kalle Valo |
Headers | show |
Series | mwifiex: Add quirks for MS Surface devices | expand |
On Friday 09 July 2021 16:58:31 Jonas Dreßler wrote: > From: Tsuchiya Yuto <kitakar@gmail.com> > > To reset mwifiex on Surface gen4+ (Pro 4 or later gen) devices, it > seems that putting the wifi device into D3cold is required according > to errata.inf file on Windows installation (Windows/INF/errata.inf). > > This patch adds a function that performs power-cycle (put into D3cold > then D0) and call the function at the end of reset_prepare(). > > Note: Need to also reset the parent device (bridge) of wifi on SB1; > it might be because the bridge of wifi always reports it's in D3hot. > When I tried to reset only the wifi device (not touching parent), it gave > the following error and the reset failed: > > acpi device:4b: Cannot transition to power state D0 for parent in D3hot > mwifiex_pcie 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible) > > Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com> > Signed-off-by: Jonas Dreßler <verdre@v0yd.nl> > --- > drivers/net/wireless/marvell/mwifiex/pcie.c | 7 + > .../wireless/marvell/mwifiex/pcie_quirks.c | 123 ++++++++++++++++++ > .../wireless/marvell/mwifiex/pcie_quirks.h | 3 + > 3 files changed, 133 insertions(+) > > diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c > index a530832c9421..c6ccce426b49 100644 > --- a/drivers/net/wireless/marvell/mwifiex/pcie.c > +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c > @@ -528,6 +528,13 @@ static void mwifiex_pcie_reset_prepare(struct pci_dev *pdev) > mwifiex_shutdown_sw(adapter); > clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags); > clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags); > + > + /* On MS Surface gen4+ devices FLR isn't effective to recover from > + * hangups, so we power-cycle the card instead. > + */ > + if (card->quirks & QUIRK_FW_RST_D3COLD) > + mwifiex_pcie_reset_d3cold_quirk(pdev); > + Hello! Now I'm thinking loudly about this patch. Why this kind of reset is needed only for Surface devices? AFAIK these 88W8897 chips are same in all cards. Chip itself implements PCIe interface (and also SDIO) so for me looks very strange if this 88W8897 PCIe device needs DMI specific quirks. I cannot believe that Microsoft got some special version of these chips from Marvell which are different than version uses on cards in mPCIe form factor. And now when I'm reading comment below about PCIe bridge to which is this 88W8897 PCIe chip connected, is not this rather an issue in that PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which controls this bridge? Or are having other people same issues on mPCIe form factor wifi cards with 88W8897 chips and then this quirk should not DMI dependent? Note that I'm seeing issues with reset and other things also on chip 88W8997 when is connected to system via SDIO. These chips have both PCIe and SDIO buses, it just depends which pins are used. > mwifiex_dbg(adapter, INFO, "%s, successful\n", __func__); > > card->pci_reset_ongoing = true; > diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c > index 4064f99b36ba..b5f214fc1212 100644 > --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c > +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c > @@ -15,6 +15,72 @@ > > /* quirk table based on DMI matching */ > static const struct dmi_system_id mwifiex_quirk_table[] = { > + { > + .ident = "Surface Pro 4", > + .matches = { > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 4"), > + }, > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > + }, > + { > + .ident = "Surface Pro 5", > + .matches = { > + /* match for SKU here due to generic product name "Surface Pro" */ > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > + DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1796"), > + }, > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > + }, > + { > + .ident = "Surface Pro 5 (LTE)", > + .matches = { > + /* match for SKU here due to generic product name "Surface Pro" */ > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > + DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1807"), > + }, > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > + }, > + { > + .ident = "Surface Pro 6", > + .matches = { > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 6"), > + }, > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > + }, > + { > + .ident = "Surface Book 1", > + .matches = { > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book"), > + }, > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > + }, > + { > + .ident = "Surface Book 2", > + .matches = { > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book 2"), > + }, > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > + }, > + { > + .ident = "Surface Laptop 1", > + .matches = { > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop"), > + }, > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > + }, > + { > + .ident = "Surface Laptop 2", > + .matches = { > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop 2"), > + }, > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > + }, > {} > }; > > @@ -29,4 +95,61 @@ void mwifiex_initialize_quirks(struct pcie_service_card *card) > > if (!card->quirks) > dev_info(&pdev->dev, "no quirks enabled\n"); > + if (card->quirks & QUIRK_FW_RST_D3COLD) > + dev_info(&pdev->dev, "quirk reset_d3cold enabled\n"); > +} > + > +static void mwifiex_pcie_set_power_d3cold(struct pci_dev *pdev) > +{ > + dev_info(&pdev->dev, "putting into D3cold...\n"); > + > + pci_save_state(pdev); > + if (pci_is_enabled(pdev)) > + pci_disable_device(pdev); > + pci_set_power_state(pdev, PCI_D3cold); > +} > + > +static int mwifiex_pcie_set_power_d0(struct pci_dev *pdev) > +{ > + int ret; > + > + dev_info(&pdev->dev, "putting into D0...\n"); > + > + pci_set_power_state(pdev, PCI_D0); > + ret = pci_enable_device(pdev); > + if (ret) { > + dev_err(&pdev->dev, "pci_enable_device failed\n"); > + return ret; > + } > + pci_restore_state(pdev); > + > + return 0; > +} > + > +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev) > +{ > + struct pci_dev *parent_pdev = pci_upstream_bridge(pdev); > + int ret; > + > + /* Power-cycle (put into D3cold then D0) */ > + dev_info(&pdev->dev, "Using reset_d3cold quirk to perform FW reset\n"); > + > + /* We need to perform power-cycle also for bridge of wifi because > + * on some devices (e.g. Surface Book 1), the OS for some reasons > + * can't know the real power state of the bridge. > + * When tried to power-cycle only wifi, the reset failed with the > + * following dmesg log: > + * "Cannot transition to power state D0 for parent in D3hot". > + */ > + mwifiex_pcie_set_power_d3cold(pdev); > + mwifiex_pcie_set_power_d3cold(parent_pdev); > + > + ret = mwifiex_pcie_set_power_d0(parent_pdev); > + if (ret) > + return ret; > + ret = mwifiex_pcie_set_power_d0(pdev); > + if (ret) > + return ret; > + > + return 0; > } > diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h > index 7a1fe3b3a61a..549093067813 100644 > --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h > +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h > @@ -5,4 +5,7 @@ > > #include "pcie.h" > > +#define QUIRK_FW_RST_D3COLD BIT(0) > + > void mwifiex_initialize_quirks(struct pcie_service_card *card); > +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev); > -- > 2.31.1 >
On 7/9/21 5:18 PM, Pali Rohár wrote: > On Friday 09 July 2021 16:58:31 Jonas Dreßler wrote: >> From: Tsuchiya Yuto <kitakar@gmail.com> >> >> To reset mwifiex on Surface gen4+ (Pro 4 or later gen) devices, it >> seems that putting the wifi device into D3cold is required according >> to errata.inf file on Windows installation (Windows/INF/errata.inf). >> >> This patch adds a function that performs power-cycle (put into D3cold >> then D0) and call the function at the end of reset_prepare(). >> >> Note: Need to also reset the parent device (bridge) of wifi on SB1; >> it might be because the bridge of wifi always reports it's in D3hot. >> When I tried to reset only the wifi device (not touching parent), it gave >> the following error and the reset failed: >> >> acpi device:4b: Cannot transition to power state D0 for parent in D3hot >> mwifiex_pcie 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible) >> >> Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com> >> Signed-off-by: Jonas Dreßler <verdre@v0yd.nl> >> --- >> drivers/net/wireless/marvell/mwifiex/pcie.c | 7 + >> .../wireless/marvell/mwifiex/pcie_quirks.c | 123 ++++++++++++++++++ >> .../wireless/marvell/mwifiex/pcie_quirks.h | 3 + >> 3 files changed, 133 insertions(+) >> >> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c >> index a530832c9421..c6ccce426b49 100644 >> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c >> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c >> @@ -528,6 +528,13 @@ static void mwifiex_pcie_reset_prepare(struct pci_dev *pdev) >> mwifiex_shutdown_sw(adapter); >> clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags); >> clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags); >> + >> + /* On MS Surface gen4+ devices FLR isn't effective to recover from >> + * hangups, so we power-cycle the card instead. >> + */ >> + if (card->quirks & QUIRK_FW_RST_D3COLD) >> + mwifiex_pcie_reset_d3cold_quirk(pdev); >> + > > Hello! Now I'm thinking loudly about this patch. Why this kind of reset > is needed only for Surface devices? AFAIK these 88W8897 chips are same > in all cards. Chip itself implements PCIe interface (and also SDIO) so > for me looks very strange if this 88W8897 PCIe device needs DMI specific > quirks. I cannot believe that Microsoft got some special version of > these chips from Marvell which are different than version uses on cards > in mPCIe form factor. > > And now when I'm reading comment below about PCIe bridge to which is > this 88W8897 PCIe chip connected, is not this rather an issue in that > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which > controls this bridge? > > Or are having other people same issues on mPCIe form factor wifi cards > with 88W8897 chips and then this quirk should not DMI dependent? > > Note that I'm seeing issues with reset and other things also on chip > 88W8997 when is connected to system via SDIO. These chips have both PCIe > and SDIO buses, it just depends which pins are used. > Hi and thanks for the quick reply! Honestly I've no idea, this is just the first method we found that allows for a proper reset of the chip. What I know is that some Surface devices need that ACPI DSM call (the one that was done in the commit I dropped in this version of the patchset) to reset the chip instead of this method. Afaik other devices with this chip don't need this resetting method, at least Marvell employees couldn't reproduce the issues on their testing devices. So would you suggest we just try to match for the pci chip 88W8897 instead? Then we'd probably have to check if there are any laptops where multiple devices are connected to the pci bridge as Amey suggested in a review before. >> mwifiex_dbg(adapter, INFO, "%s, successful\n", __func__); >> >> card->pci_reset_ongoing = true; >> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c >> index 4064f99b36ba..b5f214fc1212 100644 >> --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c >> +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c >> @@ -15,6 +15,72 @@ >> >> /* quirk table based on DMI matching */ >> static const struct dmi_system_id mwifiex_quirk_table[] = { >> + { >> + .ident = "Surface Pro 4", >> + .matches = { >> + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), >> + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 4"), >> + }, >> + .driver_data = (void *)QUIRK_FW_RST_D3COLD, >> + }, >> + { >> + .ident = "Surface Pro 5", >> + .matches = { >> + /* match for SKU here due to generic product name "Surface Pro" */ >> + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), >> + DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1796"), >> + }, >> + .driver_data = (void *)QUIRK_FW_RST_D3COLD, >> + }, >> + { >> + .ident = "Surface Pro 5 (LTE)", >> + .matches = { >> + /* match for SKU here due to generic product name "Surface Pro" */ >> + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), >> + DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1807"), >> + }, >> + .driver_data = (void *)QUIRK_FW_RST_D3COLD, >> + }, >> + { >> + .ident = "Surface Pro 6", >> + .matches = { >> + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), >> + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 6"), >> + }, >> + .driver_data = (void *)QUIRK_FW_RST_D3COLD, >> + }, >> + { >> + .ident = "Surface Book 1", >> + .matches = { >> + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), >> + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book"), >> + }, >> + .driver_data = (void *)QUIRK_FW_RST_D3COLD, >> + }, >> + { >> + .ident = "Surface Book 2", >> + .matches = { >> + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), >> + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book 2"), >> + }, >> + .driver_data = (void *)QUIRK_FW_RST_D3COLD, >> + }, >> + { >> + .ident = "Surface Laptop 1", >> + .matches = { >> + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), >> + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop"), >> + }, >> + .driver_data = (void *)QUIRK_FW_RST_D3COLD, >> + }, >> + { >> + .ident = "Surface Laptop 2", >> + .matches = { >> + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), >> + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop 2"), >> + }, >> + .driver_data = (void *)QUIRK_FW_RST_D3COLD, >> + }, >> {} >> }; >> >> @@ -29,4 +95,61 @@ void mwifiex_initialize_quirks(struct pcie_service_card *card) >> >> if (!card->quirks) >> dev_info(&pdev->dev, "no quirks enabled\n"); >> + if (card->quirks & QUIRK_FW_RST_D3COLD) >> + dev_info(&pdev->dev, "quirk reset_d3cold enabled\n"); >> +} >> + >> +static void mwifiex_pcie_set_power_d3cold(struct pci_dev *pdev) >> +{ >> + dev_info(&pdev->dev, "putting into D3cold...\n"); >> + >> + pci_save_state(pdev); >> + if (pci_is_enabled(pdev)) >> + pci_disable_device(pdev); >> + pci_set_power_state(pdev, PCI_D3cold); >> +} >> + >> +static int mwifiex_pcie_set_power_d0(struct pci_dev *pdev) >> +{ >> + int ret; >> + >> + dev_info(&pdev->dev, "putting into D0...\n"); >> + >> + pci_set_power_state(pdev, PCI_D0); >> + ret = pci_enable_device(pdev); >> + if (ret) { >> + dev_err(&pdev->dev, "pci_enable_device failed\n"); >> + return ret; >> + } >> + pci_restore_state(pdev); >> + >> + return 0; >> +} >> + >> +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev) >> +{ >> + struct pci_dev *parent_pdev = pci_upstream_bridge(pdev); >> + int ret; >> + >> + /* Power-cycle (put into D3cold then D0) */ >> + dev_info(&pdev->dev, "Using reset_d3cold quirk to perform FW reset\n"); >> + >> + /* We need to perform power-cycle also for bridge of wifi because >> + * on some devices (e.g. Surface Book 1), the OS for some reasons >> + * can't know the real power state of the bridge. >> + * When tried to power-cycle only wifi, the reset failed with the >> + * following dmesg log: >> + * "Cannot transition to power state D0 for parent in D3hot". >> + */ >> + mwifiex_pcie_set_power_d3cold(pdev); >> + mwifiex_pcie_set_power_d3cold(parent_pdev); >> + >> + ret = mwifiex_pcie_set_power_d0(parent_pdev); >> + if (ret) >> + return ret; >> + ret = mwifiex_pcie_set_power_d0(pdev); >> + if (ret) >> + return ret; >> + >> + return 0; >> } >> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h >> index 7a1fe3b3a61a..549093067813 100644 >> --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h >> +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h >> @@ -5,4 +5,7 @@ >> >> #include "pcie.h" >> >> +#define QUIRK_FW_RST_D3COLD BIT(0) >> + >> void mwifiex_initialize_quirks(struct pcie_service_card *card); >> +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev); >> -- >> 2.31.1 >>
On Fri, Jul 9, 2021 at 6:18 PM Pali Rohár <pali@kernel.org> wrote: > On Friday 09 July 2021 16:58:31 Jonas Dreßler wrote: > Hello! Now I'm thinking loudly about this patch. Why this kind of reset > is needed only for Surface devices? AFAIK these 88W8897 chips are same > in all cards. Chip itself implements PCIe interface (and also SDIO) so > for me looks very strange if this 88W8897 PCIe device needs DMI specific > quirks. I cannot believe that Microsoft got some special version of > these chips from Marvell which are different than version uses on cards > in mPCIe form factor. > > And now when I'm reading comment below about PCIe bridge to which is > this 88W8897 PCIe chip connected, is not this rather an issue in that > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which > controls this bridge? > > Or are having other people same issues on mPCIe form factor wifi cards > with 88W8897 chips and then this quirk should not DMI dependent? > > Note that I'm seeing issues with reset and other things also on chip > 88W8997 when is connected to system via SDIO. These chips have both PCIe > and SDIO buses, it just depends which pins are used. I'm replying loudly :-) You know that depending on the interface the firmware even for the same chip may be way different. And if you have had any experience working in product companies you should know well that bug in product X is not gonna be fixed if it was not reported, but gets fixed on product Y due to that. Besides that, how do you know that MS has not been given the special edition of the FW? As icing on the cake, the Marvell has been bought and I believe they abandoned their products quite a while ago. You may read kernel bugzilla for the details (last Marvell developer who answered to the reports seems has no clue about the driver). All that said, I believe that we have to follow whatever fixes we would have and be thankful to contributors and testers. For the record, I've been suffering from the Linux driver of this hardware for a while. And I'm fully in support of any quirks that will help UX.
On Friday 09 July 2021 17:33:34 Jonas Dreßler wrote: > On 7/9/21 5:18 PM, Pali Rohár wrote: > > On Friday 09 July 2021 16:58:31 Jonas Dreßler wrote: > > > From: Tsuchiya Yuto <kitakar@gmail.com> > > > > > > To reset mwifiex on Surface gen4+ (Pro 4 or later gen) devices, it > > > seems that putting the wifi device into D3cold is required according > > > to errata.inf file on Windows installation (Windows/INF/errata.inf). > > > > > > This patch adds a function that performs power-cycle (put into D3cold > > > then D0) and call the function at the end of reset_prepare(). > > > > > > Note: Need to also reset the parent device (bridge) of wifi on SB1; > > > it might be because the bridge of wifi always reports it's in D3hot. > > > When I tried to reset only the wifi device (not touching parent), it gave > > > the following error and the reset failed: > > > > > > acpi device:4b: Cannot transition to power state D0 for parent in D3hot > > > mwifiex_pcie 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible) > > > > > > Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com> > > > Signed-off-by: Jonas Dreßler <verdre@v0yd.nl> > > > --- > > > drivers/net/wireless/marvell/mwifiex/pcie.c | 7 + > > > .../wireless/marvell/mwifiex/pcie_quirks.c | 123 ++++++++++++++++++ > > > .../wireless/marvell/mwifiex/pcie_quirks.h | 3 + > > > 3 files changed, 133 insertions(+) > > > > > > diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c > > > index a530832c9421..c6ccce426b49 100644 > > > --- a/drivers/net/wireless/marvell/mwifiex/pcie.c > > > +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c > > > @@ -528,6 +528,13 @@ static void mwifiex_pcie_reset_prepare(struct pci_dev *pdev) > > > mwifiex_shutdown_sw(adapter); > > > clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags); > > > clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags); > > > + > > > + /* On MS Surface gen4+ devices FLR isn't effective to recover from > > > + * hangups, so we power-cycle the card instead. > > > + */ > > > + if (card->quirks & QUIRK_FW_RST_D3COLD) > > > + mwifiex_pcie_reset_d3cold_quirk(pdev); > > > + > > > > Hello! Now I'm thinking loudly about this patch. Why this kind of reset > > is needed only for Surface devices? AFAIK these 88W8897 chips are same > > in all cards. Chip itself implements PCIe interface (and also SDIO) so > > for me looks very strange if this 88W8897 PCIe device needs DMI specific > > quirks. I cannot believe that Microsoft got some special version of > > these chips from Marvell which are different than version uses on cards > > in mPCIe form factor. > > > > And now when I'm reading comment below about PCIe bridge to which is > > this 88W8897 PCIe chip connected, is not this rather an issue in that > > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which > > controls this bridge? > > > > Or are having other people same issues on mPCIe form factor wifi cards > > with 88W8897 chips and then this quirk should not DMI dependent? > > > > Note that I'm seeing issues with reset and other things also on chip > > 88W8997 when is connected to system via SDIO. These chips have both PCIe > > and SDIO buses, it just depends which pins are used. > > > > Hi and thanks for the quick reply! Honestly I've no idea, this is just the > first method we found that allows for a proper reset of the chip. What I > know is that some Surface devices need that ACPI DSM call (the one that was > done in the commit I dropped in this version of the patchset) to reset the > chip instead of this method. > > Afaik other devices with this chip don't need this resetting method, at > least Marvell employees couldn't reproduce the issues on their testing > devices. > > So would you suggest we just try to match for the pci chip 88W8897 instead? Hello! Such suggestion makes sense when we know that it is 88W8897 issue. But if you got information that issue cannot be reproduced on other 88W8897 cards then matching 88W8897 is not correct. From all this information looks like that it is problem in (Microsoft?) PCIe bridge to which is card connected. Otherwise I do not reason how it can be 88W8897 affected. Either it is reproducible on 88W8897 cards also in other devices or issue is not on 88W8897 card. > Then we'd probably have to check if there are any laptops where multiple > devices are connected to the pci bridge as Amey suggested in a review > before. Well, I do not know... But if this is issue with PCIe bridge then similar issue could be observed also for other PCIe devices with this PCIe bridge. But question is if there are other laptops with this PCIe bridge. And also it can be a problem in ACPI firmware on those Surface devices, which implements some PCIe bridge functionality. So it is possible that issue is with PCIe bridge, not in HW, but in SW/firmware part which can be Microsoft specific... So too many questions to which we do not know answers. Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on affected machines? If you have already sent it in some previous email, just send a link. At least I'm not able to find it right now and output may contain something useful... > > > mwifiex_dbg(adapter, INFO, "%s, successful\n", __func__); > > > card->pci_reset_ongoing = true; > > > diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c > > > index 4064f99b36ba..b5f214fc1212 100644 > > > --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c > > > +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c > > > @@ -15,6 +15,72 @@ > > > /* quirk table based on DMI matching */ > > > static const struct dmi_system_id mwifiex_quirk_table[] = { > > > + { > > > + .ident = "Surface Pro 4", > > > + .matches = { > > > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > > > + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 4"), > > > + }, > > > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > > > + }, > > > + { > > > + .ident = "Surface Pro 5", > > > + .matches = { > > > + /* match for SKU here due to generic product name "Surface Pro" */ > > > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > > > + DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1796"), > > > + }, > > > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > > > + }, > > > + { > > > + .ident = "Surface Pro 5 (LTE)", > > > + .matches = { > > > + /* match for SKU here due to generic product name "Surface Pro" */ > > > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > > > + DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1807"), > > > + }, > > > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > > > + }, > > > + { > > > + .ident = "Surface Pro 6", > > > + .matches = { > > > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > > > + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 6"), > > > + }, > > > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > > > + }, > > > + { > > > + .ident = "Surface Book 1", > > > + .matches = { > > > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > > > + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book"), > > > + }, > > > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > > > + }, > > > + { > > > + .ident = "Surface Book 2", > > > + .matches = { > > > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > > > + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book 2"), > > > + }, > > > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > > > + }, > > > + { > > > + .ident = "Surface Laptop 1", > > > + .matches = { > > > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > > > + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop"), > > > + }, > > > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > > > + }, > > > + { > > > + .ident = "Surface Laptop 2", > > > + .matches = { > > > + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), > > > + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop 2"), > > > + }, > > > + .driver_data = (void *)QUIRK_FW_RST_D3COLD, > > > + }, > > > {} > > > }; > > > @@ -29,4 +95,61 @@ void mwifiex_initialize_quirks(struct pcie_service_card *card) > > > if (!card->quirks) > > > dev_info(&pdev->dev, "no quirks enabled\n"); > > > + if (card->quirks & QUIRK_FW_RST_D3COLD) > > > + dev_info(&pdev->dev, "quirk reset_d3cold enabled\n"); > > > +} > > > + > > > +static void mwifiex_pcie_set_power_d3cold(struct pci_dev *pdev) > > > +{ > > > + dev_info(&pdev->dev, "putting into D3cold...\n"); > > > + > > > + pci_save_state(pdev); > > > + if (pci_is_enabled(pdev)) > > > + pci_disable_device(pdev); > > > + pci_set_power_state(pdev, PCI_D3cold); > > > +} > > > + > > > +static int mwifiex_pcie_set_power_d0(struct pci_dev *pdev) > > > +{ > > > + int ret; > > > + > > > + dev_info(&pdev->dev, "putting into D0...\n"); > > > + > > > + pci_set_power_state(pdev, PCI_D0); > > > + ret = pci_enable_device(pdev); > > > + if (ret) { > > > + dev_err(&pdev->dev, "pci_enable_device failed\n"); > > > + return ret; > > > + } > > > + pci_restore_state(pdev); > > > + > > > + return 0; > > > +} > > > + > > > +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev) > > > +{ > > > + struct pci_dev *parent_pdev = pci_upstream_bridge(pdev); > > > + int ret; > > > + > > > + /* Power-cycle (put into D3cold then D0) */ > > > + dev_info(&pdev->dev, "Using reset_d3cold quirk to perform FW reset\n"); > > > + > > > + /* We need to perform power-cycle also for bridge of wifi because > > > + * on some devices (e.g. Surface Book 1), the OS for some reasons > > > + * can't know the real power state of the bridge. > > > + * When tried to power-cycle only wifi, the reset failed with the > > > + * following dmesg log: > > > + * "Cannot transition to power state D0 for parent in D3hot". > > > + */ > > > + mwifiex_pcie_set_power_d3cold(pdev); > > > + mwifiex_pcie_set_power_d3cold(parent_pdev); > > > + > > > + ret = mwifiex_pcie_set_power_d0(parent_pdev); > > > + if (ret) > > > + return ret; > > > + ret = mwifiex_pcie_set_power_d0(pdev); > > > + if (ret) > > > + return ret; > > > + > > > + return 0; > > > } > > > diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h > > > index 7a1fe3b3a61a..549093067813 100644 > > > --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h > > > +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h > > > @@ -5,4 +5,7 @@ > > > #include "pcie.h" > > > +#define QUIRK_FW_RST_D3COLD BIT(0) > > > + > > > void mwifiex_initialize_quirks(struct pcie_service_card *card); > > > +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev); > > > -- > > > 2.31.1 > > > >
On Friday 09 July 2021 19:01:44 Andy Shevchenko wrote: > On Fri, Jul 9, 2021 at 6:18 PM Pali Rohár <pali@kernel.org> wrote: > > On Friday 09 July 2021 16:58:31 Jonas Dreßler wrote: > > > > Hello! Now I'm thinking loudly about this patch. Why this kind of reset > > is needed only for Surface devices? AFAIK these 88W8897 chips are same > > in all cards. Chip itself implements PCIe interface (and also SDIO) so > > for me looks very strange if this 88W8897 PCIe device needs DMI specific > > quirks. I cannot believe that Microsoft got some special version of > > these chips from Marvell which are different than version uses on cards > > in mPCIe form factor. > > > > And now when I'm reading comment below about PCIe bridge to which is > > this 88W8897 PCIe chip connected, is not this rather an issue in that > > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which > > controls this bridge? > > > > Or are having other people same issues on mPCIe form factor wifi cards > > with 88W8897 chips and then this quirk should not DMI dependent? > > > > Note that I'm seeing issues with reset and other things also on chip > > 88W8997 when is connected to system via SDIO. These chips have both PCIe > > and SDIO buses, it just depends which pins are used. > > I'm replying loudly :-) > > You know that depending on the interface the firmware even for the > same chip may be way different. And if you have had any experience > working in product companies you should know well that bug in product > X is not gonna be fixed if it was not reported, but gets fixed on > product Y due to that. Besides that, how do you know that MS has not > been given the special edition of the FW? Yes! But I know something about these chips/cards (I have also one development kit) and it is quite different. It is possible that Microsoft may have its special version, because I know that e.g. Google got "fixed version" for some 88W8xxx chips (it is/was available on internet). But firmware is loading by mwifiex driver and we (linux-firmware) have just one version of firmware for these cards. These 88W8xxx cards lost state and running firmware after reset/power so after linux is booted, it loads "linux-firmware" version into 88W8897 card and then card not run "possibly MS special edition FW". What can be possible is that we are dealing with ACPI firmware (which is same for both Windows and Linux OS) and then it is related to PCIe bridge where are some PCIe parts implemented... > As icing on the cake, the Marvell has been bought and I believe they > abandoned their products quite a while ago. You may read kernel > bugzilla for the details (last Marvell developer who answered to the > reports seems has no clue about the driver). Marvell 88W[89]xxx wifi cards were sold to NXP together with developers. Old @marvell addresses do not work so it is required to find new @nxp addresses for developers. There are recent firmware upgrades from NXP for linux-firmware, see: https://lore.kernel.org/linux-firmware/DB7PR04MB453855B0D6C41923BCB0922EFC1C9@DB7PR04MB4538.eurprd04.prod.outlook.com/ (just only SDIO firmware for 88W8897, not PCIe) But I know that response from NXP about these cards is slow... And more people are complaining about firmware/driver issues for these cards. > All that said, I believe that we have to follow whatever fixes we > would have and be thankful to contributors and testers. I agree. If this is really fixing these issues and we do not have better fix yet, go ahead with it (if PCIe/WiFi maintainers are fine with it). Once we have better fix, we can replace it. Currently I'm just trying to understand this issue more deeply (e.g. how it can relate with similar issue which I see on SDIO and if I cannot fix also SDIO in better way) but seems that nobody knows more than "this hack/quirk somehow works for PCIe". > For the record, I've been suffering from the Linux driver of this > hardware for a while. And I'm fully in support of any quirks that will > help UX. > > -- > With Best Regards, > Andy Shevchenko
On 7/9/21 6:12 PM, Pali Rohár wrote: [...] >>> Hello! Now I'm thinking loudly about this patch. Why this kind of reset >>> is needed only for Surface devices? AFAIK these 88W8897 chips are same >>> in all cards. Chip itself implements PCIe interface (and also SDIO) so >>> for me looks very strange if this 88W8897 PCIe device needs DMI specific >>> quirks. I cannot believe that Microsoft got some special version of >>> these chips from Marvell which are different than version uses on cards >>> in mPCIe form factor. >>> >>> And now when I'm reading comment below about PCIe bridge to which is >>> this 88W8897 PCIe chip connected, is not this rather an issue in that >>> PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which >>> controls this bridge? >>> >>> Or are having other people same issues on mPCIe form factor wifi cards >>> with 88W8897 chips and then this quirk should not DMI dependent? >>> >>> Note that I'm seeing issues with reset and other things also on chip >>> 88W8997 when is connected to system via SDIO. These chips have both PCIe >>> and SDIO buses, it just depends which pins are used. >>> >> >> Hi and thanks for the quick reply! Honestly I've no idea, this is just the >> first method we found that allows for a proper reset of the chip. What I >> know is that some Surface devices need that ACPI DSM call (the one that was >> done in the commit I dropped in this version of the patchset) to reset the >> chip instead of this method. >> >> Afaik other devices with this chip don't need this resetting method, at >> least Marvell employees couldn't reproduce the issues on their testing >> devices. >> >> So would you suggest we just try to match for the pci chip 88W8897 instead? > > Hello! Such suggestion makes sense when we know that it is 88W8897 > issue. But if you got information that issue cannot be reproduced on > other 88W8897 cards then matching 88W8897 is not correct. > > From all this information looks like that it is problem in (Microsoft?) > PCIe bridge to which is card connected. Otherwise I do not reason how it > can be 88W8897 affected. Either it is reproducible on 88W8897 cards also > in other devices or issue is not on 88W8897 card. I doubt that it's an issue with the PCIe bridge (itself at least). The same type of bridge is used for both dGPU and NVME SSD on my device (see lspci output below) and those work fine. Also if I'm seeing that right it's from the Intel CPU, so my guess is that a lot more people would have issues with that then. I don't know about the hardware side, so it might be possible that it's an issue with integrating both bridge and wifi chip, in which case it's still probably best handled via DMI quirks unless we know more. Also as Tsuchiya mentioned in his original submission, on Windows the device is reset via this D3cold method. I've only skimmed that errata.inf file mentioned, but I think this is what he's referring to: Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be evaluated or not on a given platform. Currently ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be evaluated on Surface platforms which contain the Marvell WiFi controller which depends on device going through D3Cold as part of surprise-removal. and Starting with Windows releases *after* Blue, ACPI will not put surprise-removed devices into D3Cold automatically. Some known scenarios (viz. WiFi reset/recovery) rely on the device cycling through D3Cold on surprise-removal. This hack allows surprise-removed devices to be put into D3Cold (if supported by the stack). So, as far as I can tell, the chip doesn't like to be surprise-removed (which seems to happen during reset) and then needs to be power-cycled, which I think is likely due to some issue with firmware state. So the quirk on Windows seems very Surface specific. There also seem a bunch of revisions of these chips around, for example my SB2 is affected by a bug that we've tied to the specific hardware revision which causes some issues with host-sleep (IIRC chip switches rapidly between wake and sleep states without any external influence, which is not how it should behave and how it does behave on a later hardware revision). >> Then we'd probably have to check if there are any laptops where multiple >> devices are connected to the pci bridge as Amey suggested in a review >> before. > > Well, I do not know... But if this is issue with PCIe bridge then > similar issue could be observed also for other PCIe devices with this > PCIe bridge. But question is if there are other laptops with this PCIe > bridge. And also it can be a problem in ACPI firmware on those Surface > devices, which implements some PCIe bridge functionality. So it is > possible that issue is with PCIe bridge, not in HW, but in SW/firmware > part which can be Microsoft specific... So too many questions to which > we do not know answers. > > Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on > affected machines? If you have already sent it in some previous email, > just send a link. At least I'm not able to find it right now and output > may contain something useful... From my Surface Book 2 (with the same issue): - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/ - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/ Regards, Max
On Friday 09 July 2021 19:03:37 Maximilian Luz wrote: > On 7/9/21 6:12 PM, Pali Rohár wrote: > > [...] > > > > > Hello! Now I'm thinking loudly about this patch. Why this kind of reset > > > > is needed only for Surface devices? AFAIK these 88W8897 chips are same > > > > in all cards. Chip itself implements PCIe interface (and also SDIO) so > > > > for me looks very strange if this 88W8897 PCIe device needs DMI specific > > > > quirks. I cannot believe that Microsoft got some special version of > > > > these chips from Marvell which are different than version uses on cards > > > > in mPCIe form factor. > > > > > > > > And now when I'm reading comment below about PCIe bridge to which is > > > > this 88W8897 PCIe chip connected, is not this rather an issue in that > > > > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which > > > > controls this bridge? > > > > > > > > Or are having other people same issues on mPCIe form factor wifi cards > > > > with 88W8897 chips and then this quirk should not DMI dependent? > > > > > > > > Note that I'm seeing issues with reset and other things also on chip > > > > 88W8997 when is connected to system via SDIO. These chips have both PCIe > > > > and SDIO buses, it just depends which pins are used. > > > > > > > > > > Hi and thanks for the quick reply! Honestly I've no idea, this is just the > > > first method we found that allows for a proper reset of the chip. What I > > > know is that some Surface devices need that ACPI DSM call (the one that was > > > done in the commit I dropped in this version of the patchset) to reset the > > > chip instead of this method. > > > > > > Afaik other devices with this chip don't need this resetting method, at > > > least Marvell employees couldn't reproduce the issues on their testing > > > devices. > > > > > > So would you suggest we just try to match for the pci chip 88W8897 instead? > > > > Hello! Such suggestion makes sense when we know that it is 88W8897 > > issue. But if you got information that issue cannot be reproduced on > > other 88W8897 cards then matching 88W8897 is not correct. > > > > From all this information looks like that it is problem in (Microsoft?) > > PCIe bridge to which is card connected. Otherwise I do not reason how it > > can be 88W8897 affected. Either it is reproducible on 88W8897 cards also > > in other devices or issue is not on 88W8897 card. > > I doubt that it's an issue with the PCIe bridge (itself at least). The > same type of bridge is used for both dGPU and NVME SSD on my device (see > lspci output below) and those work fine. Also if I'm seeing that right > it's from the Intel CPU, so my guess is that a lot more people would > have issues with that then. From information below it seems to be related to surprise removal. Therefore is surprise removal working without issue for dGPU or NVME SSD? Not all PCIe bridges support surprise removal... > > I don't know about the hardware side, so it might be possible that it's > an issue with integrating both bridge and wifi chip, in which case it's > still probably best handled via DMI quirks unless we know more. > > Also as Tsuchiya mentioned in his original submission, on Windows the > device is reset via this D3cold method. I've only skimmed that > errata.inf file mentioned, but I think this is what he's referring to: > > Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be > evaluated or not on a given platform. Currently > ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be > evaluated on Surface platforms which contain the Marvell WiFi > controller which depends on device going through D3Cold as part of > surprise-removal. > > and > > Starting with Windows releases *after* Blue, ACPI will not put > surprise-removed devices into D3Cold automatically. Some known > scenarios (viz. WiFi reset/recovery) rely on the device cycling > through D3Cold on surprise-removal. This hack allows surprise-removed > devices to be put into D3Cold (if supported by the stack). > > So, as far as I can tell, the chip doesn't like to be surprise-removed > (which seems to happen during reset) and then needs to be power-cycled, > which I think is likely due to some issue with firmware state. Thanks for information. This really does not look like PCIe bridge specific if bridge itself can handle surprise-removed devices. lspci can tell us if bridge supports it or not (see below). > So the quirk on Windows seems very Surface specific. > > There also seem a bunch of revisions of these chips around, for example > my SB2 is affected by a bug that we've tied to the specific hardware > revision which causes some issues with host-sleep (IIRC chip switches > rapidly between wake and sleep states without any external influence, > which is not how it should behave and how it does behave on a later > hardware revision). Interesting... This looks like the issue can be in 88W8897 chip and needs some special conditions to trigger? And Surface is triggering it always? > > > Then we'd probably have to check if there are any laptops where multiple > > > devices are connected to the pci bridge as Amey suggested in a review > > > before. > > > > Well, I do not know... But if this is issue with PCIe bridge then > > similar issue could be observed also for other PCIe devices with this > > PCIe bridge. But question is if there are other laptops with this PCIe > > bridge. And also it can be a problem in ACPI firmware on those Surface > > devices, which implements some PCIe bridge functionality. So it is > > possible that issue is with PCIe bridge, not in HW, but in SW/firmware > > part which can be Microsoft specific... So too many questions to which > > we do not know answers. > > > > Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on > > affected machines? If you have already sent it in some previous email, > > just send a link. At least I'm not able to find it right now and output > > may contain something useful... > > From my Surface Book 2 (with the same issue): > > - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/ > - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/ Could you re-run lspci under root account? There are missing important parts like "Capabilities: <access denied>" where is information if bridge supports surprise removal or not. > Regards, > Max
On 7/9/21 7:30 PM, Pali Rohár wrote: > On Friday 09 July 2021 19:03:37 Maximilian Luz wrote: >> On 7/9/21 6:12 PM, Pali Rohár wrote: >> >> [...] >> >>>>> Hello! Now I'm thinking loudly about this patch. Why this kind of reset >>>>> is needed only for Surface devices? AFAIK these 88W8897 chips are same >>>>> in all cards. Chip itself implements PCIe interface (and also SDIO) so >>>>> for me looks very strange if this 88W8897 PCIe device needs DMI specific >>>>> quirks. I cannot believe that Microsoft got some special version of >>>>> these chips from Marvell which are different than version uses on cards >>>>> in mPCIe form factor. >>>>> >>>>> And now when I'm reading comment below about PCIe bridge to which is >>>>> this 88W8897 PCIe chip connected, is not this rather an issue in that >>>>> PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which >>>>> controls this bridge? >>>>> >>>>> Or are having other people same issues on mPCIe form factor wifi cards >>>>> with 88W8897 chips and then this quirk should not DMI dependent? >>>>> >>>>> Note that I'm seeing issues with reset and other things also on chip >>>>> 88W8997 when is connected to system via SDIO. These chips have both PCIe >>>>> and SDIO buses, it just depends which pins are used. >>>>> >>>> >>>> Hi and thanks for the quick reply! Honestly I've no idea, this is just the >>>> first method we found that allows for a proper reset of the chip. What I >>>> know is that some Surface devices need that ACPI DSM call (the one that was >>>> done in the commit I dropped in this version of the patchset) to reset the >>>> chip instead of this method. >>>> >>>> Afaik other devices with this chip don't need this resetting method, at >>>> least Marvell employees couldn't reproduce the issues on their testing >>>> devices. >>>> >>>> So would you suggest we just try to match for the pci chip 88W8897 instead? >>> >>> Hello! Such suggestion makes sense when we know that it is 88W8897 >>> issue. But if you got information that issue cannot be reproduced on >>> other 88W8897 cards then matching 88W8897 is not correct. >>> >>> From all this information looks like that it is problem in (Microsoft?) >>> PCIe bridge to which is card connected. Otherwise I do not reason how it >>> can be 88W8897 affected. Either it is reproducible on 88W8897 cards also >>> in other devices or issue is not on 88W8897 card. >> >> I doubt that it's an issue with the PCIe bridge (itself at least). The >> same type of bridge is used for both dGPU and NVME SSD on my device (see >> lspci output below) and those work fine. Also if I'm seeing that right >> it's from the Intel CPU, so my guess is that a lot more people would >> have issues with that then. > > From information below it seems to be related to surprise removal. > Therefore is surprise removal working without issue for dGPU or NVME > SSD? Not all PCIe bridges support surprise removal... The dGPU on the Surface Book 2 is detachable (the whole base where that is placed can be removed). As far as I can tell surprise removal works perfectly fine for that one. The only thing that it needs is a driver for out-of-band hot-plug signalling if the device is in D3cold while removed as hotplug/removal notifications via PCI don't work in D3cold (this works via ACPI, there is as far as I can tell no such mechanism for WiFi, probably since it's not intended to be hot-unplugged). >> I don't know about the hardware side, so it might be possible that it's >> an issue with integrating both bridge and wifi chip, in which case it's >> still probably best handled via DMI quirks unless we know more. >> >> Also as Tsuchiya mentioned in his original submission, on Windows the >> device is reset via this D3cold method. I've only skimmed that >> errata.inf file mentioned, but I think this is what he's referring to: >> >> Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be >> evaluated or not on a given platform. Currently >> ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be >> evaluated on Surface platforms which contain the Marvell WiFi >> controller which depends on device going through D3Cold as part of >> surprise-removal. >> >> and >> >> Starting with Windows releases *after* Blue, ACPI will not put >> surprise-removed devices into D3Cold automatically. Some known >> scenarios (viz. WiFi reset/recovery) rely on the device cycling >> through D3Cold on surprise-removal. This hack allows surprise-removed >> devices to be put into D3Cold (if supported by the stack). >> >> So, as far as I can tell, the chip doesn't like to be surprise-removed >> (which seems to happen during reset) and then needs to be power-cycled, >> which I think is likely due to some issue with firmware state. > > Thanks for information. This really does not look like PCIe bridge > specific if bridge itself can handle surprise-removed devices. lspci can > tell us if bridge supports it or not (see below). > >> So the quirk on Windows seems very Surface specific. >> >> There also seem a bunch of revisions of these chips around, for example >> my SB2 is affected by a bug that we've tied to the specific hardware >> revision which causes some issues with host-sleep (IIRC chip switches >> rapidly between wake and sleep states without any external influence, >> which is not how it should behave and how it does behave on a later >> hardware revision). > > Interesting... This looks like the issue can be in 88W8897 chip and > needs some special conditions to trigger? And Surface is triggering it > always? Not always. It's been a while since I've been actively looking at this and I'm not sure we ever had a good way to reproduce this. Also, I've never really dealt with it as in-depth as Tsuchiya and Jonas have. My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at reproducing this didn't work, so I think at very least a network connection needs to be active. Unfortunately I can't test that with a network connection (and without compiling a custom kernel for which I don't have the time right now) because there's currently another bug deadlocking on device removal if there's an active connection during removal (which also seems to trigger on reset). That one ill be fixed by https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/ Jonas might know more. >>>> Then we'd probably have to check if there are any laptops where multiple >>>> devices are connected to the pci bridge as Amey suggested in a review >>>> before. >>> >>> Well, I do not know... But if this is issue with PCIe bridge then >>> similar issue could be observed also for other PCIe devices with this >>> PCIe bridge. But question is if there are other laptops with this PCIe >>> bridge. And also it can be a problem in ACPI firmware on those Surface >>> devices, which implements some PCIe bridge functionality. So it is >>> possible that issue is with PCIe bridge, not in HW, but in SW/firmware >>> part which can be Microsoft specific... So too many questions to which >>> we do not know answers. >>> >>> Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on >>> affected machines? If you have already sent it in some previous email, >>> just send a link. At least I'm not able to find it right now and output >>> may contain something useful... >> >> From my Surface Book 2 (with the same issue): >> >> - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/ >> - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/ > > Could you re-run lspci under root account? There are missing important > parts like "Capabilities: <access denied>" where is information if > bridge supports surprise removal or not. Ah sorry, sure thing. Here's the updated lspci -nn -vv log: https://paste.ubuntu.com/p/fzsmCvm86Y/ The log for lspci -tvnn is the same. >> Regards, >> Max
On Friday 09 July 2021 20:16:49 Maximilian Luz wrote: > On 7/9/21 7:30 PM, Pali Rohár wrote: > > On Friday 09 July 2021 19:03:37 Maximilian Luz wrote: > > > On 7/9/21 6:12 PM, Pali Rohár wrote: > > > > > > [...] > > > > > > > > > Hello! Now I'm thinking loudly about this patch. Why this kind of reset > > > > > > is needed only for Surface devices? AFAIK these 88W8897 chips are same > > > > > > in all cards. Chip itself implements PCIe interface (and also SDIO) so > > > > > > for me looks very strange if this 88W8897 PCIe device needs DMI specific > > > > > > quirks. I cannot believe that Microsoft got some special version of > > > > > > these chips from Marvell which are different than version uses on cards > > > > > > in mPCIe form factor. > > > > > > > > > > > > And now when I'm reading comment below about PCIe bridge to which is > > > > > > this 88W8897 PCIe chip connected, is not this rather an issue in that > > > > > > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which > > > > > > controls this bridge? > > > > > > > > > > > > Or are having other people same issues on mPCIe form factor wifi cards > > > > > > with 88W8897 chips and then this quirk should not DMI dependent? > > > > > > > > > > > > Note that I'm seeing issues with reset and other things also on chip > > > > > > 88W8997 when is connected to system via SDIO. These chips have both PCIe > > > > > > and SDIO buses, it just depends which pins are used. > > > > > > > > > > > > > > > > Hi and thanks for the quick reply! Honestly I've no idea, this is just the > > > > > first method we found that allows for a proper reset of the chip. What I > > > > > know is that some Surface devices need that ACPI DSM call (the one that was > > > > > done in the commit I dropped in this version of the patchset) to reset the > > > > > chip instead of this method. > > > > > > > > > > Afaik other devices with this chip don't need this resetting method, at > > > > > least Marvell employees couldn't reproduce the issues on their testing > > > > > devices. > > > > > > > > > > So would you suggest we just try to match for the pci chip 88W8897 instead? > > > > > > > > Hello! Such suggestion makes sense when we know that it is 88W8897 > > > > issue. But if you got information that issue cannot be reproduced on > > > > other 88W8897 cards then matching 88W8897 is not correct. > > > > > > > > From all this information looks like that it is problem in (Microsoft?) > > > > PCIe bridge to which is card connected. Otherwise I do not reason how it > > > > can be 88W8897 affected. Either it is reproducible on 88W8897 cards also > > > > in other devices or issue is not on 88W8897 card. > > > > > > I doubt that it's an issue with the PCIe bridge (itself at least). The > > > same type of bridge is used for both dGPU and NVME SSD on my device (see > > > lspci output below) and those work fine. Also if I'm seeing that right > > > it's from the Intel CPU, so my guess is that a lot more people would > > > have issues with that then. > > > > From information below it seems to be related to surprise removal. > > Therefore is surprise removal working without issue for dGPU or NVME > > SSD? Not all PCIe bridges support surprise removal... > > The dGPU on the Surface Book 2 is detachable (the whole base where that > is placed can be removed). As far as I can tell surprise removal works > perfectly fine for that one. The only thing that it needs is a driver for > out-of-band hot-plug signalling if the device is in D3cold while removed > as hotplug/removal notifications via PCI don't work in D3cold (this > works via ACPI, there is as far as I can tell no such mechanism for > WiFi, probably since it's not intended to be hot-unplugged). Ok. Thank you for confirmation. > > > I don't know about the hardware side, so it might be possible that it's > > > an issue with integrating both bridge and wifi chip, in which case it's > > > still probably best handled via DMI quirks unless we know more. > > > > > > Also as Tsuchiya mentioned in his original submission, on Windows the > > > device is reset via this D3cold method. I've only skimmed that > > > errata.inf file mentioned, but I think this is what he's referring to: > > > > > > Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be > > > evaluated or not on a given platform. Currently > > > ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be > > > evaluated on Surface platforms which contain the Marvell WiFi > > > controller which depends on device going through D3Cold as part of > > > surprise-removal. > > > > > > and > > > > > > Starting with Windows releases *after* Blue, ACPI will not put > > > surprise-removed devices into D3Cold automatically. Some known > > > scenarios (viz. WiFi reset/recovery) rely on the device cycling > > > through D3Cold on surprise-removal. This hack allows surprise-removed > > > devices to be put into D3Cold (if supported by the stack). > > > > > > So, as far as I can tell, the chip doesn't like to be surprise-removed > > > (which seems to happen during reset) and then needs to be power-cycled, > > > which I think is likely due to some issue with firmware state. > > > > Thanks for information. This really does not look like PCIe bridge > > specific if bridge itself can handle surprise-removed devices. lspci can > > tell us if bridge supports it or not (see below). > > > > > So the quirk on Windows seems very Surface specific. > > > > > > There also seem a bunch of revisions of these chips around, for example > > > my SB2 is affected by a bug that we've tied to the specific hardware > > > revision which causes some issues with host-sleep (IIRC chip switches > > > rapidly between wake and sleep states without any external influence, > > > which is not how it should behave and how it does behave on a later > > > hardware revision). > > > > Interesting... This looks like the issue can be in 88W8897 chip and > > needs some special conditions to trigger? And Surface is triggering it > > always? > > Not always. It's been a while since I've been actively looking at this > and I'm not sure we ever had a good way to reproduce this. Also, I've > never really dealt with it as in-depth as Tsuchiya and Jonas have. > > My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at > reproducing this didn't work, so I think at very least a network > connection needs to be active. This is doing PCIe function level reset. Maybe you can get more luck with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux > Unfortunately I can't test that with a > network connection (and without compiling a custom kernel for which I > don't have the time right now) because there's currently another bug > deadlocking on device removal if there's an active connection during > removal (which also seems to trigger on reset). That one ill be fixed > by > > https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/ > > Jonas might know more. > > > > > > Then we'd probably have to check if there are any laptops where multiple > > > > > devices are connected to the pci bridge as Amey suggested in a review > > > > > before. > > > > > > > > Well, I do not know... But if this is issue with PCIe bridge then > > > > similar issue could be observed also for other PCIe devices with this > > > > PCIe bridge. But question is if there are other laptops with this PCIe > > > > bridge. And also it can be a problem in ACPI firmware on those Surface > > > > devices, which implements some PCIe bridge functionality. So it is > > > > possible that issue is with PCIe bridge, not in HW, but in SW/firmware > > > > part which can be Microsoft specific... So too many questions to which > > > > we do not know answers. > > > > > > > > Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on > > > > affected machines? If you have already sent it in some previous email, > > > > just send a link. At least I'm not able to find it right now and output > > > > may contain something useful... > > > > > > From my Surface Book 2 (with the same issue): > > > > > > - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/ > > > - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/ > > > > Could you re-run lspci under root account? There are missing important > > parts like "Capabilities: <access denied>" where is information if > > bridge supports surprise removal or not. > > Ah sorry, sure thing. Here's the updated lspci -nn -vv log: > > https://paste.ubuntu.com/p/fzsmCvm86Y/ > > The log for lspci -tvnn is the same. Ok. So bridge for wifi card (00:1c.0) indicates: SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise- Slot #0, PowerLimit 10.000W; Interlock- NoCompl+ No support for surprise removal, nor for hotplug interrupt. But bridge for nvidia card (00:1c.4) indicates: SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Slot #4, PowerLimit 25.000W; Interlock- NoCompl+ And interesting, it supports hotplug interrupt and also surprise removal. Which matches what you wrote above about dGPU. So another idea: maybe problem is really in 88W8897 and recovering is working only via bridge which supports surprise removal? Just guessing. Or kernel PCIe hotplug driver is doing something which is needed for recovering 88W8897 and because this bridge does not support surprise removal, it behaves differently? > > > > Regards, > > > Max
On 7/9/21 8:44 PM, Pali Rohár wrote: [...] >> My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at >> reproducing this didn't work, so I think at very least a network >> connection needs to be active. > > This is doing PCIe function level reset. Maybe you can get more luck > with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset > from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux Thanks for that link! That does indeed do something which breaks the adapter. Running the script produces [ 178.388414] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed [ 178.389128] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed [ 178.461365] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... [ 178.461373] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done [ 178.984106] pci 0000:01:00.0: [11ab:2b38] type 00 class 0x020000 [ 178.984161] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit pref] [ 178.984193] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x000fffff 64bit pref] [ 178.984430] pci 0000:01:00.0: supports D1 D2 [ 178.984434] pci 0000:01:00.0: PME# supported from D0 D1 D3hot D3cold [ 178.984871] pcieport 0000:00:1c.0: ASPM: current common clock configuration is inconsistent, reconfiguring [ 179.297919] pci 0000:01:00.0: BAR 0: assigned [mem 0xd4400000-0xd44fffff 64bit pref] [ 179.297961] pci 0000:01:00.0: BAR 2: assigned [mem 0xd4500000-0xd45fffff 64bit pref] [ 179.298316] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002) [ 179.298752] mwifiex_pcie: PCI memory map Virt0: 00000000c4593df1 PCI memory map Virt2: 0000000039d67daf [ 179.300522] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed! [ 179.300552] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device [ 179.300622] mwifiex_pcie 0000:01:00.0: Read register failed [ 179.300912] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... [ 179.300928] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done after which the card is unusable (there is no WiFi interface availabel any more). Reloading the driver module doesn't help and produces [ 376.906833] mwifiex_pcie: PCI memory map Virt0: 0000000025149d28 PCI memory map Virt2: 00000000c4593df1 [ 376.907278] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed! [ 376.907281] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device [ 376.907293] mwifiex_pcie 0000:01:00.0: Read register failed [ 376.907404] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... [ 376.907406] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done again. Performing a function level reset produces [ 402.489572] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_prepare: adapter structure is not valid [ 403.514219] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_done: adapter structure is not valid and doesn't help either. Running the same command on a kernel with (among other) this patch unfortunately also breaks the adapter in the same way. As far as I can tell though, it doesn't run through the reset code added by this patch (as indicated by the log message when performing FLR), which I assume in a non-forced scenario, e.g. firmware issues (which IIRC is why this patch exists), it would? >> Unfortunately I can't test that with a >> network connection (and without compiling a custom kernel for which I >> don't have the time right now) because there's currently another bug >> deadlocking on device removal if there's an active connection during >> removal (which also seems to trigger on reset). That one ill be fixed >> by >> >> https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/ >> >> Jonas might know more. [...]
On Friday 09 July 2021 21:27:51 Maximilian Luz wrote: > On 7/9/21 8:44 PM, Pali Rohár wrote: > > [...] > > > > My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at > > > reproducing this didn't work, so I think at very least a network > > > connection needs to be active. > > > > This is doing PCIe function level reset. Maybe you can get more luck > > with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset > > from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux > > Thanks for that link! That does indeed do something which breaks the > adapter. Running the script produces > > [ 178.388414] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 178.389128] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 178.461365] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... > [ 178.461373] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done > [ 178.984106] pci 0000:01:00.0: [11ab:2b38] type 00 class 0x020000 > [ 178.984161] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit pref] > [ 178.984193] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x000fffff 64bit pref] > [ 178.984430] pci 0000:01:00.0: supports D1 D2 > [ 178.984434] pci 0000:01:00.0: PME# supported from D0 D1 D3hot D3cold > [ 178.984871] pcieport 0000:00:1c.0: ASPM: current common clock configuration is inconsistent, reconfiguring > [ 179.297919] pci 0000:01:00.0: BAR 0: assigned [mem 0xd4400000-0xd44fffff 64bit pref] > [ 179.297961] pci 0000:01:00.0: BAR 2: assigned [mem 0xd4500000-0xd45fffff 64bit pref] > [ 179.298316] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002) > [ 179.298752] mwifiex_pcie: PCI memory map Virt0: 00000000c4593df1 PCI memory map Virt2: 0000000039d67daf > [ 179.300522] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed! > [ 179.300552] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device > [ 179.300622] mwifiex_pcie 0000:01:00.0: Read register failed > [ 179.300912] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... > [ 179.300928] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done > > after which the card is unusable (there is no WiFi interface availabel > any more). Reloading the driver module doesn't help and produces > > [ 376.906833] mwifiex_pcie: PCI memory map Virt0: 0000000025149d28 PCI memory map Virt2: 00000000c4593df1 > [ 376.907278] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed! > [ 376.907281] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device > [ 376.907293] mwifiex_pcie 0000:01:00.0: Read register failed > [ 376.907404] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... > [ 376.907406] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done > > again. Performing a function level reset produces > > [ 402.489572] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_prepare: adapter structure is not valid > [ 403.514219] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_done: adapter structure is not valid > > and doesn't help either. More Qualcomm/Atheros wifi cards are broken in a way that they stop responding after PCIe Hot Reset and completely disappear from the PCIe bus. It is possible that similar issue have also these Marvell cards? As now we know that bride does not support hotplug it cannot inform system when card disconnect from the bus. The one possible way how to detect if PCIe card is available at specific address is trying to read its device and vendor id. Kernel caches device/vendor id, so from userspace is needed to call lspci with -b argument (to ignore kernel's reported cached value). Could you provide output of following command after you do PCIe Hot Reset? lspci -s 01:00.0 -b -vv (and compare with output which you have already provided if there are any differences) If PCIe Hot Reset is breaking the card then the only option how to "reset" card into working state again is PCIe Warm Reset. Unfortunately there is no common mechanism how to do it from system. PCIe Warm Reset is done by asserting PERST# signal on card itself, in mPCIe form factor it is pin 22. In most cases pin 22 is connected to some GPIO so via GPIO subsystem it could be controlled. > Running the same command on a kernel with (among other) this patch > unfortunately also breaks the adapter in the same way. As far as I can > tell though, it doesn't run through the reset code added by this patch > (as indicated by the log message when performing FLR), which I assume > in a non-forced scenario, e.g. firmware issues (which IIRC is why this > patch exists), it would? Err... I have caught this part. Is proposed patch able to recover also from state which happens after PCIe Hot Reset? > > > Unfortunately I can't test that with a > > > network connection (and without compiling a custom kernel for which I > > > don't have the time right now) because there's currently another bug > > > deadlocking on device removal if there's an active connection during > > > removal (which also seems to trigger on reset). That one ill be fixed > > > by > > > > > > https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/ > > > > > > Jonas might know more. > > [...]
On 7/9/21 9:44 PM, Pali Rohár wrote: > On Friday 09 July 2021 21:27:51 Maximilian Luz wrote: >> On 7/9/21 8:44 PM, Pali Rohár wrote: >> >> [...] >> >>>> My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at >>>> reproducing this didn't work, so I think at very least a network >>>> connection needs to be active. >>> >>> This is doing PCIe function level reset. Maybe you can get more luck >>> with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset >>> from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux >> >> Thanks for that link! That does indeed do something which breaks the >> adapter. Running the script produces >> >> [ 178.388414] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed >> [ 178.389128] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed >> [ 178.461365] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... >> [ 178.461373] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done >> [ 178.984106] pci 0000:01:00.0: [11ab:2b38] type 00 class 0x020000 >> [ 178.984161] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit pref] >> [ 178.984193] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x000fffff 64bit pref] >> [ 178.984430] pci 0000:01:00.0: supports D1 D2 >> [ 178.984434] pci 0000:01:00.0: PME# supported from D0 D1 D3hot D3cold >> [ 178.984871] pcieport 0000:00:1c.0: ASPM: current common clock configuration is inconsistent, reconfiguring >> [ 179.297919] pci 0000:01:00.0: BAR 0: assigned [mem 0xd4400000-0xd44fffff 64bit pref] >> [ 179.297961] pci 0000:01:00.0: BAR 2: assigned [mem 0xd4500000-0xd45fffff 64bit pref] >> [ 179.298316] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002) >> [ 179.298752] mwifiex_pcie: PCI memory map Virt0: 00000000c4593df1 PCI memory map Virt2: 0000000039d67daf >> [ 179.300522] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed! >> [ 179.300552] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device >> [ 179.300622] mwifiex_pcie 0000:01:00.0: Read register failed >> [ 179.300912] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... >> [ 179.300928] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done >> >> after which the card is unusable (there is no WiFi interface availabel >> any more). Reloading the driver module doesn't help and produces >> >> [ 376.906833] mwifiex_pcie: PCI memory map Virt0: 0000000025149d28 PCI memory map Virt2: 00000000c4593df1 >> [ 376.907278] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed! >> [ 376.907281] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device >> [ 376.907293] mwifiex_pcie 0000:01:00.0: Read register failed >> [ 376.907404] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... >> [ 376.907406] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done >> >> again. Performing a function level reset produces >> >> [ 402.489572] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_prepare: adapter structure is not valid >> [ 403.514219] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_done: adapter structure is not valid >> >> and doesn't help either. > > More Qualcomm/Atheros wifi cards are broken in a way that they stop > responding after PCIe Hot Reset and completely disappear from the PCIe > bus. It is possible that similar issue have also these Marvell cards? > > As now we know that bride does not support hotplug it cannot inform > system when card disconnect from the bus. The one possible way how to > detect if PCIe card is available at specific address is trying to read > its device and vendor id. Kernel caches device/vendor id, so from > userspace is needed to call lspci with -b argument (to ignore kernel's > reported cached value). Could you provide output of following command > after you do PCIe Hot Reset? > > lspci -s 01:00.0 -b -vv > > (and compare with output which you have already provided if there are > any differences) There do seem to be some differences, specifically regarding memory. See https://paste.ubuntu.com/p/Rz2CDjwkCv/ for the full output. > If PCIe Hot Reset is breaking the card then the only option how to > "reset" card into working state again is PCIe Warm Reset. Unfortunately > there is no common mechanism how to do it from system. PCIe Warm Reset > is done by asserting PERST# signal on card itself, in mPCIe form factor > it is pin 22. In most cases pin 22 is connected to some GPIO so via GPIO > subsystem it could be controlled. > >> Running the same command on a kernel with (among other) this patch >> unfortunately also breaks the adapter in the same way. As far as I can >> tell though, it doesn't run through the reset code added by this patch >> (as indicated by the log message when performing FLR), which I assume >> in a non-forced scenario, e.g. firmware issues (which IIRC is why this >> patch exists), it would? > > Err... I have caught this part. Is proposed patch able to recover also > from state which happens after PCIe Hot Reset? I'm not sure at this point if the power-cycle through D3cold would fix this (I think it might, but have no evidence for that). This patch alone isn't able to recover the device, as, when triggering the hot-reset via that script, the code never seems to run mwifiex_pcie_reset_d3cold_quirk(). If I remember correctly, the main issue was that the firmware state doesn't get reset completely. This can be somewhat observed when doing 'echo 1 > /sys/bus/pci/devices/.../reset' via the difference in log messages: For an unpatched kernel: [ 64.159509] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex... [ 64.159546] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed [ 64.159922] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed [ 65.240272] mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip FW dnld [ 65.240285] mwifiex_pcie 0000:01:00.0: WLAN FW is active [ 65.327359] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21) [ 65.327370] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.19.p21) For a patched kernel: [ 41.966094] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex... [ 41.966451] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed [ 41.967227] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed [ 42.063543] mwifiex_pcie 0000:01:00.0: Using reset_d3cold quirk to perform FW reset [ 42.063558] mwifiex_pcie 0000:01:00.0: putting into D3cold... [ 42.081010] usb 1-6: USB disconnect, device number 9 [ 42.339922] pcieport 0000:00:1c.0: putting into D3cold... [ 42.425766] pcieport 0000:00:1c.0: putting into D0... [ 42.695987] mwifiex_pcie 0000:01:00.0: putting into D0... [ 42.956673] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002) [ 45.012736] mwifiex_pcie 0000:01:00.0: info: FW download over, size 723540 bytes [ 45.740882] usb 1-6: new high-speed USB device number 10 using xhci_hcd [ 45.881294] usb 1-6: New USB device found, idVendor=1286, idProduct=204c, bcdDevice=32.01 [ 45.881308] usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [ 45.881313] usb 1-6: Product: Bluetooth and Wireless LAN Composite Device [ 45.881318] usb 1-6: Manufacturer: Marvell [ 45.881322] usb 1-6: SerialNumber: 0000000000000000 [ 45.884882] Bluetooth: hci0: unexpected event for opcode 0x0000 [ 45.885128] Bluetooth: hci0: unexpected event for opcode 0x0000 [ 45.966218] mwifiex_pcie 0000:01:00.0: WLAN FW is active [ 46.157474] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21) [ 46.157485] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.19.p21) Note the absence of "WLAN FW already running! Skip FW dnld" on the second log. Due to the power cycle we're essentially forcing the device to re-download and re-initialize its firmware. On an unpatched kernel, it looks like the firmware itself is kept and state may not be cleared properly. So in other words it looks like the firmware, while being prompted to do a reset, doesn't do that properly (at least when it has crashed before). IIRC this then allowed us to recover from firmware issues that the "normal" firmware reset that mwifiex supposedly performs on a FLR (or reloading driver modules) didn't help with. I'm not so sure any more if resetting actively caused issues or if it just showed different symptoms of some firmware issue that prompted us to do the reset in the first place (again, been quite a while since I last dealt with this stuff, sorry). All I know is that this patched reset gets the card going again. As a side note: There are also more patches by Jonas (and Tsuchiya?) building on top of the quirk implementation introduced here which significantly reduce the need for doing resets in the first place (nevertheless having a reset that actually does properly reset the device is a good thing IMHO). Those patches can for example be found here: https://github.com/linux-surface/kernel/compare/eaaf96ba58a5fe5999b89fe3afaded74caa96480...989c8725a6d4e62db6370dd0fefe45498274d3ce >>>> Unfortunately I can't test that with a >>>> network connection (and without compiling a custom kernel for which I >>>> don't have the time right now) because there's currently another bug >>>> deadlocking on device removal if there's an active connection during >>>> removal (which also seems to trigger on reset). That one ill be fixed >>>> by >>>> >>>> https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/ >>>> >>>> Jonas might know more. >> >> [...]
On Friday 09 July 2021 22:54:09 Maximilian Luz wrote: > On 7/9/21 9:44 PM, Pali Rohár wrote: > > On Friday 09 July 2021 21:27:51 Maximilian Luz wrote: > > > On 7/9/21 8:44 PM, Pali Rohár wrote: > > > > > > [...] > > > > > > > > My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at > > > > > reproducing this didn't work, so I think at very least a network > > > > > connection needs to be active. > > > > > > > > This is doing PCIe function level reset. Maybe you can get more luck > > > > with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset > > > > from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux > > > > > > Thanks for that link! That does indeed do something which breaks the > > > adapter. Running the script produces > > > > > > [ 178.388414] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > > > [ 178.389128] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > > > [ 178.461365] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... > > > [ 178.461373] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done > > > [ 178.984106] pci 0000:01:00.0: [11ab:2b38] type 00 class 0x020000 > > > [ 178.984161] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit pref] > > > [ 178.984193] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x000fffff 64bit pref] > > > [ 178.984430] pci 0000:01:00.0: supports D1 D2 > > > [ 178.984434] pci 0000:01:00.0: PME# supported from D0 D1 D3hot D3cold > > > [ 178.984871] pcieport 0000:00:1c.0: ASPM: current common clock configuration is inconsistent, reconfiguring > > > [ 179.297919] pci 0000:01:00.0: BAR 0: assigned [mem 0xd4400000-0xd44fffff 64bit pref] > > > [ 179.297961] pci 0000:01:00.0: BAR 2: assigned [mem 0xd4500000-0xd45fffff 64bit pref] > > > [ 179.298316] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002) > > > [ 179.298752] mwifiex_pcie: PCI memory map Virt0: 00000000c4593df1 PCI memory map Virt2: 0000000039d67daf > > > [ 179.300522] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed! > > > [ 179.300552] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device > > > [ 179.300622] mwifiex_pcie 0000:01:00.0: Read register failed > > > [ 179.300912] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... > > > [ 179.300928] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done > > > > > > after which the card is unusable (there is no WiFi interface availabel > > > any more). Reloading the driver module doesn't help and produces > > > > > > [ 376.906833] mwifiex_pcie: PCI memory map Virt0: 0000000025149d28 PCI memory map Virt2: 00000000c4593df1 > > > [ 376.907278] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed! > > > [ 376.907281] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device > > > [ 376.907293] mwifiex_pcie 0000:01:00.0: Read register failed > > > [ 376.907404] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... > > > [ 376.907406] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done > > > > > > again. Performing a function level reset produces > > > > > > [ 402.489572] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_prepare: adapter structure is not valid > > > [ 403.514219] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_done: adapter structure is not valid > > > > > > and doesn't help either. > > > > More Qualcomm/Atheros wifi cards are broken in a way that they stop > > responding after PCIe Hot Reset and completely disappear from the PCIe > > bus. It is possible that similar issue have also these Marvell cards? > > > > As now we know that bride does not support hotplug it cannot inform > > system when card disconnect from the bus. The one possible way how to > > detect if PCIe card is available at specific address is trying to read > > its device and vendor id. Kernel caches device/vendor id, so from > > userspace is needed to call lspci with -b argument (to ignore kernel's > > reported cached value). Could you provide output of following command > > after you do PCIe Hot Reset? > > > > lspci -s 01:00.0 -b -vv > > > > (and compare with output which you have already provided if there are > > any differences) > > There do seem to be some differences, specifically regarding memory. > See https://paste.ubuntu.com/p/Rz2CDjwkCv/ for the full output. Output is OK, card is alive and available on bus. So at least we are not dealing with the similar issue like are observed in Qualcomm/Atheros. Missing memory just means that driver has not enabled memory access (because it did not successfully initialized). Anyway, card is reporting error via AER: Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 14, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 40000001 0000000f d4500c58 00000000 There is (UESta) marked that Unsupported Request happened and in HeaderLog is information about error. Unfortunately I'm not skilled to decode it, so maybe other PCI developers could say if there is something important in HeaderLog or not. > > If PCIe Hot Reset is breaking the card then the only option how to > > "reset" card into working state again is PCIe Warm Reset. Unfortunately > > there is no common mechanism how to do it from system. PCIe Warm Reset > > is done by asserting PERST# signal on card itself, in mPCIe form factor > > it is pin 22. In most cases pin 22 is connected to some GPIO so via GPIO > > subsystem it could be controlled. > > > > > Running the same command on a kernel with (among other) this patch > > > unfortunately also breaks the adapter in the same way. As far as I can > > > tell though, it doesn't run through the reset code added by this patch > > > (as indicated by the log message when performing FLR), which I assume > > > in a non-forced scenario, e.g. firmware issues (which IIRC is why this > > > patch exists), it would? > > > > Err... I have caught this part. Is proposed patch able to recover also > > from state which happens after PCIe Hot Reset? > > I'm not sure at this point if the power-cycle through D3cold would fix > this (I think it might, but have no evidence for that). This patch alone > isn't able to recover the device, as, when triggering the hot-reset via > that script, the code never seems to run mwifiex_pcie_reset_d3cold_quirk(). > > If I remember correctly, the main issue was that the firmware state > doesn't get reset completely. This can be somewhat observed when doing > 'echo 1 > /sys/bus/pci/devices/.../reset' via the difference in log > messages: > > For an unpatched kernel: > > [ 64.159509] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex... > [ 64.159546] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 64.159922] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 65.240272] mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip FW dnld > [ 65.240285] mwifiex_pcie 0000:01:00.0: WLAN FW is active > [ 65.327359] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21) > [ 65.327370] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.19.p21) > > For a patched kernel: > > [ 41.966094] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex... > [ 41.966451] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 41.967227] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 42.063543] mwifiex_pcie 0000:01:00.0: Using reset_d3cold quirk to perform FW reset > [ 42.063558] mwifiex_pcie 0000:01:00.0: putting into D3cold... > [ 42.081010] usb 1-6: USB disconnect, device number 9 > [ 42.339922] pcieport 0000:00:1c.0: putting into D3cold... > [ 42.425766] pcieport 0000:00:1c.0: putting into D0... > [ 42.695987] mwifiex_pcie 0000:01:00.0: putting into D0... > [ 42.956673] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002) > [ 45.012736] mwifiex_pcie 0000:01:00.0: info: FW download over, size 723540 bytes > [ 45.740882] usb 1-6: new high-speed USB device number 10 using xhci_hcd > [ 45.881294] usb 1-6: New USB device found, idVendor=1286, idProduct=204c, bcdDevice=32.01 > [ 45.881308] usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3 > [ 45.881313] usb 1-6: Product: Bluetooth and Wireless LAN Composite Device > [ 45.881318] usb 1-6: Manufacturer: Marvell > [ 45.881322] usb 1-6: SerialNumber: 0000000000000000 > [ 45.884882] Bluetooth: hci0: unexpected event for opcode 0x0000 > [ 45.885128] Bluetooth: hci0: unexpected event for opcode 0x0000 > [ 45.966218] mwifiex_pcie 0000:01:00.0: WLAN FW is active > [ 46.157474] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21) > [ 46.157485] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.19.p21) Interesting... putting PCIe card into D3cold cause shutdown also of USB bluetooth part. > Note the absence of "WLAN FW already running! Skip FW dnld" on the > second log. Due to the power cycle we're essentially forcing the device > to re-download and re-initialize its firmware. Yea, this proves that putting into D3cold and returning back to D0 cause full reset of the card, both PCIe and USB parts. ARM CPU on this card therefore must be reset and so new full firmware download and start it again is needed. > On an unpatched kernel, > it looks like the firmware itself is kept and state may not be cleared > properly. So in other words it looks like the firmware, while being > prompted to do a reset, doesn't do that properly (at least when it has > crashed before). > > IIRC this then allowed us to recover from firmware issues that the > "normal" firmware reset that mwifiex supposedly performs on a FLR > (or reloading driver modules) didn't help with. PCIe Function Level Reset should reset only one PCIe part of device. And seems that this type of reset does not work properly in some situations. Note that PCIe Function Level Reset is independent of running firmware. It is implement in hardware and (should) work also at early stage when firmware is not loaded yet. I'm starting to think more and more if quirk in this patch really needs to be behind DMI check and if rather it should not be called on other platforms too? > I'm not so sure any more if resetting actively caused issues or if it > just showed different symptoms of some firmware issue that prompted us > to do the reset in the first place (again, been quite a while since I > last dealt with this stuff, sorry). All I know is that this patched > reset gets the card going again. > > As a side note: There are also more patches by Jonas (and Tsuchiya?) > building on top of the quirk implementation introduced here which > significantly reduce the need for doing resets in the first place > (nevertheless having a reset that actually does properly reset the > device is a good thing IMHO). Those patches can for example be found > here: > > https://github.com/linux-surface/kernel/compare/eaaf96ba58a5fe5999b89fe3afaded74caa96480...989c8725a6d4e62db6370dd0fefe45498274d3ce > > > > > > Unfortunately I can't test that with a > > > > > network connection (and without compiling a custom kernel for which I > > > > > don't have the time right now) because there's currently another bug > > > > > deadlocking on device removal if there's an active connection during > > > > > removal (which also seems to trigger on reset). That one ill be fixed > > > > > by > > > > > > > > > > https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/ > > > > > > > > > > Jonas might know more. > > > > > > [...]
On 7/9/21 11:25 PM, Pali Rohár wrote: [...] > PCIe Function Level Reset should reset only one PCIe part of device. And > seems that this type of reset does not work properly in some situations. > > Note that PCIe Function Level Reset is independent of running firmware. > It is implement in hardware and (should) work also at early stage when > firmware is not loaded yet. > > I'm starting to think more and more if quirk in this patch really needs > to be behind DMI check and if rather it should not be called on other > platforms too? Maybe? I'm not sure how well this behaves on other devices and if there even are any devices outside of the MS Surface line that really require or benefit from something like this. To me it seems safer to put it behind quirks, at least until we know more. Also not sure if this is just my bias, but it feels like the Surface line always had more problems with that driver (and firmware) than others. I'm honestly a bit surprised that MS stuck with them for this long (they decided to go with Intel for 7th gen devices). AFAICT they initially chose Marvell due to connected standby support, so maybe that causes issue for us and others simply aren't using that feature? Just guessing though.
On Saturday 10 July 2021 00:25:36 Maximilian Luz wrote: > On 7/9/21 11:25 PM, Pali Rohár wrote: > > [...] > > > PCIe Function Level Reset should reset only one PCIe part of device. And > > seems that this type of reset does not work properly in some situations. > > > > Note that PCIe Function Level Reset is independent of running firmware. > > It is implement in hardware and (should) work also at early stage when > > firmware is not loaded yet. > > > > I'm starting to think more and more if quirk in this patch really needs > > to be behind DMI check and if rather it should not be called on other > > platforms too? > > Maybe? I'm not sure how well this behaves on other devices and if there > even are any devices outside of the MS Surface line that really require > or benefit from something like this. To me it seems safer to put it > behind quirks, at least until we know more. This is a good argument for DMI based quirk. But at the end, PCI people should look at this patch and say what do they thing about it, if it is better to enable it only for specific set of machines (Surface) or it is better to enable it for every 88W8xxx wifi or for none. > Also not sure if this is just my bias, but it feels like the Surface > line always had more problems with that driver (and firmware) than > others. Ehm, really? I see reports also from non-Surface users about bad quality of these 88W[89]xxx cards and repeating firmware issues. I have bad personal experience with 88W8997 SDIO firmware and lot of times I get advice about ex-Marvell/NXP wifi cards "do not touch and run away quickly". I think that more people if they get mPCIe/M.2 wifi card in laptop which does not work, they just replace it with some working one. And not spending infinite time in trying to fix it... So this may explain why there are more Surface users with these issues... > I'm honestly a bit surprised that MS stuck with them for this > long (they decided to go with Intel for 7th gen devices). AFAICT they > initially chose Marvell due to connected standby support, so maybe that > causes issue for us and others simply aren't using that feature? Just > guessing though. In my opinion that "Connected Standby" is just MS marketing term. 88W[89]xxx chips using full-mac firmware and drivers [*]. Full-mac lot of times causing more issues than soft-mac solution. Moreover this Marvell firmware implements also other "application" layers in firmware which OS drivers can use, e.g. there is fully working "wpa-supplicant" replacement and also AP part. Maybe windows drivers are using it and it cause less problems? Duno. mwifiex uses only "low level" commands and WPA state machine is implemented in userspace wpa-supplicant daemon. [*] - Small note: There are also soft-mac firmwares and drivers but apparently Marvell has never finished linux driver and firmware was not released to public... And there is also Laird Connectivity which offers their own proprietary linux kernel drivers with their own firmware for these 88W[89]xxx chips. Last time I checked it they released some parts of driver on github. Maybe somebody could contact Laird or check if their driver can be replaced by mwifiex? Or just replacing ex-Marvell/NXP firmware by their? But I'm not sure if they have something for 88W8897.
On 7/10/21 12:54 AM, Pali Rohár wrote: [...] >> Also not sure if this is just my bias, but it feels like the Surface >> line always had more problems with that driver (and firmware) than >> others. > > Ehm, really? I see reports also from non-Surface users about bad quality > of these 88W[89]xxx cards and repeating firmware issues. I have bad > personal experience with 88W8997 SDIO firmware and lot of times I get > advice about ex-Marvell/NXP wifi cards "do not touch and run away > quickly". Yeah, then I'm probably biased since I'm mostly dealing with Surface stuff. > I think that more people if they get mPCIe/M.2 wifi card in laptop which > does not work, they just replace it with some working one. And not > spending infinite time in trying to fix it... So this may explain why > there are more Surface users with these issues... That might be an explanation. If it wouldn't need a heat-gun to open it up, I'd probably have done that at some point in the past (there were times when WiFi at my Uni was pretty much unusable with this device... and I'm still not sure what fixed that or even if it's fixed completely). >> I'm honestly a bit surprised that MS stuck with them for this >> long (they decided to go with Intel for 7th gen devices). AFAICT they >> initially chose Marvell due to connected standby support, so maybe that >> causes issue for us and others simply aren't using that feature? Just >> guessing though. > > In my opinion that "Connected Standby" is just MS marketing term. I can only really repeat what I've been told: Apparently when they started designing those devices, the only option with "Connected standby" (or probably rather that feature set that MS wanted) was, unfortunately for us, Marvell. > 88W[89]xxx chips using full-mac firmware and drivers [*]. Full-mac lot > of times causing more issues than soft-mac solution. Moreover this > Marvell firmware implements also other "application" layers in firmware > which OS drivers can use, e.g. there is fully working "wpa-supplicant" > replacement and also AP part. Maybe windows drivers are using it and it > cause less problems? Duno. mwifiex uses only "low level" commands and > WPA state machine is implemented in userspace wpa-supplicant daemon. > > [*] - Small note: There are also soft-mac firmwares and drivers but > apparently Marvell has never finished linux driver and firmware was not > released to public... > > And there is also Laird Connectivity which offers their own proprietary > linux kernel drivers with their own firmware for these 88W[89]xxx chips. > Last time I checked it they released some parts of driver on github. > Maybe somebody could contact Laird or check if their driver can be > replaced by mwifiex? Or just replacing ex-Marvell/NXP firmware by their? > But I'm not sure if they have something for 88W8897. Interesting, I was not aware of this. IIRC we've been experimenting with the mwlwifi driver (which that lrdmwl driver seems to be based on?), but couldn't get that to work with the firmware we have. IIRC it also didn't work with the Windows firmware (which seems to be significantly different from the one we have for Linux and seems to use or be modeled after some special Windows WiFi driver interface).
On Saturday 10 July 2021 02:00:08 Maximilian Luz wrote: > On 7/10/21 12:54 AM, Pali Rohár wrote: > > [...] > > > > Also not sure if this is just my bias, but it feels like the Surface > > > line always had more problems with that driver (and firmware) than > > > others. > > > > Ehm, really? I see reports also from non-Surface users about bad quality > > of these 88W[89]xxx cards and repeating firmware issues. I have bad > > personal experience with 88W8997 SDIO firmware and lot of times I get > > advice about ex-Marvell/NXP wifi cards "do not touch and run away > > quickly". > > Yeah, then I'm probably biased since I'm mostly dealing with Surface > stuff. > > > I think that more people if they get mPCIe/M.2 wifi card in laptop which > > does not work, they just replace it with some working one. And not > > spending infinite time in trying to fix it... So this may explain why > > there are more Surface users with these issues... > > That might be an explanation. If it wouldn't need a heat-gun to open it > up, I'd probably have done that at some point in the past (there were > times when WiFi at my Uni was pretty much unusable with this device... > and I'm still not sure what fixed that or even if it's fixed completely). > > > > I'm honestly a bit surprised that MS stuck with them for this > > > long (they decided to go with Intel for 7th gen devices). AFAICT they > > > initially chose Marvell due to connected standby support, so maybe that > > > causes issue for us and others simply aren't using that feature? Just > > > guessing though. > > > > In my opinion that "Connected Standby" is just MS marketing term. > > I can only really repeat what I've been told: Apparently when they > started designing those devices, the only option with "Connected > standby" (or probably rather that feature set that MS wanted) was, > unfortunately for us, Marvell. > > > 88W[89]xxx chips using full-mac firmware and drivers [*]. Full-mac lot > > of times causing more issues than soft-mac solution. Moreover this > > Marvell firmware implements also other "application" layers in firmware > > which OS drivers can use, e.g. there is fully working "wpa-supplicant" > > replacement and also AP part. Maybe windows drivers are using it and it > > cause less problems? Duno. mwifiex uses only "low level" commands and > > WPA state machine is implemented in userspace wpa-supplicant daemon. > > > > [*] - Small note: There are also soft-mac firmwares and drivers but > > apparently Marvell has never finished linux driver and firmware was not > > released to public... > > > > And there is also Laird Connectivity which offers their own proprietary > > linux kernel drivers with their own firmware for these 88W[89]xxx chips. > > Last time I checked it they released some parts of driver on github. > > Maybe somebody could contact Laird or check if their driver can be > > replaced by mwifiex? Or just replacing ex-Marvell/NXP firmware by their? > > But I'm not sure if they have something for 88W8897. > > Interesting, I was not aware of this. IIRC we've been experimenting with > the mwlwifi driver (which that lrdmwl driver seems to be based on?), but > couldn't get that to work with the firmware we have. mwlwifi is that soft-mac driver and uses completely different firmware. For sure it would not work with current full-mac firmware. > IIRC it also didn't > work with the Windows firmware (which seems to be significantly > different from the one we have for Linux and seems to use or be modeled > after some special Windows WiFi driver interface). So... Microsoft has different firmware for this chip? And it is working with mwifiex driver?
On 7/10/21 2:07 AM, Pali Rohár wrote: [...] >> Interesting, I was not aware of this. IIRC we've been experimenting with >> the mwlwifi driver (which that lrdmwl driver seems to be based on?), but >> couldn't get that to work with the firmware we have. > > mwlwifi is that soft-mac driver and uses completely different firmware. > For sure it would not work with current full-mac firmware. > >> IIRC it also didn't >> work with the Windows firmware (which seems to be significantly >> different from the one we have for Linux and seems to use or be modeled >> after some special Windows WiFi driver interface). > > So... Microsoft has different firmware for this chip? And it is working > with mwifiex driver? I'm not sure how special that firmware really is (i.e. if it is Surface specific or just what Marvell uses on Windows), only that it doesn't look like the firmware included in the linux-firmware repo. The Windows firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and on Linux we use the official firmware from the linux-firmware repo.
On Saturday 10 July 2021 02:18:12 Maximilian Luz wrote: > On 7/10/21 2:07 AM, Pali Rohár wrote: > > [...] > > > > Interesting, I was not aware of this. IIRC we've been experimenting with > > > the mwlwifi driver (which that lrdmwl driver seems to be based on?), but > > > couldn't get that to work with the firmware we have. > > > > mwlwifi is that soft-mac driver and uses completely different firmware. > > For sure it would not work with current full-mac firmware. > > > > > IIRC it also didn't > > > work with the Windows firmware (which seems to be significantly > > > different from the one we have for Linux and seems to use or be modeled > > > after some special Windows WiFi driver interface). > > > > So... Microsoft has different firmware for this chip? And it is working > > with mwifiex driver? > > I'm not sure how special that firmware really is (i.e. if it is Surface > specific or just what Marvell uses on Windows), only that it doesn't > look like the firmware included in the linux-firmware repo. The Windows > firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and > on Linux we use the official firmware from the linux-firmware repo. Version available in the linux-firmware repo is also what big companies (like google) receive for their systems... sometimes just only older version as Marvell/NXP is slow in updating files in linux-firmware. Seems that it is also same what receive customers under NDA as more companies dropped "proprietary" ex-Marvell/NXP driver on internet and it contained this firmware with some sources of driver which looks like a fork of mwifiex (or maybe mwifiex is "cleaned fork" of that driver :D) There is old firmware documentation which describe RPC communication between OS and firmware: http://wiki.laptop.org/images/f/f3/Firmware-Spec-v5.1-MV-S103752-00.pdf It is really old for very old wifi chips and when I checked it, it still matches what mwifiex is doing with new chips. Just there are new and more commands. And documentation is OS-neutral. So if Microsoft has some "incompatible" firmware with this, it could mean that they got something special which nobody else have? Maybe it can explain that "connected standby" and maybe also better stability? Or just windows distribute firmware in different format and needs to "unpack" or "preprocess" prior downloading it to device?
On 7/10/21 2:38 AM, Pali Rohár wrote: > On Saturday 10 July 2021 02:18:12 Maximilian Luz wrote: >> On 7/10/21 2:07 AM, Pali Rohár wrote: >> >> [...] >> >>>> Interesting, I was not aware of this. IIRC we've been experimenting with >>>> the mwlwifi driver (which that lrdmwl driver seems to be based on?), but >>>> couldn't get that to work with the firmware we have. >>> >>> mwlwifi is that soft-mac driver and uses completely different firmware. >>> For sure it would not work with current full-mac firmware. >>> >>>> IIRC it also didn't >>>> work with the Windows firmware (which seems to be significantly >>>> different from the one we have for Linux and seems to use or be modeled >>>> after some special Windows WiFi driver interface). >>> >>> So... Microsoft has different firmware for this chip? And it is working >>> with mwifiex driver? >> >> I'm not sure how special that firmware really is (i.e. if it is Surface >> specific or just what Marvell uses on Windows), only that it doesn't >> look like the firmware included in the linux-firmware repo. The Windows >> firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and >> on Linux we use the official firmware from the linux-firmware repo. > > Version available in the linux-firmware repo is also what big companies > (like google) receive for their systems... sometimes just only older > version as Marvell/NXP is slow in updating files in linux-firmware. > Seems that it is also same what receive customers under NDA as more > companies dropped "proprietary" ex-Marvell/NXP driver on internet and it > contained this firmware with some sources of driver which looks like a > fork of mwifiex (or maybe mwifiex is "cleaned fork" of that driver :D) > > There is old firmware documentation which describe RPC communication > between OS and firmware: > http://wiki.laptop.org/images/f/f3/Firmware-Spec-v5.1-MV-S103752-00.pdf > > It is really old for very old wifi chips and when I checked it, it still > matches what mwifiex is doing with new chips. Just there are new and > more commands. And documentation is OS-neutral. > > So if Microsoft has some "incompatible" firmware with this, it could > mean that they got something special which nobody else have? Maybe it > can explain that "connected standby" and maybe also better stability? > > Or just windows distribute firmware in different format and needs to > "unpack" or "preprocess" prior downloading it to device? If memory serves me right, Jonas did some reverse engineering on the Windows driver and found that it uses the "new" WDI Miniport API: It seems that originally both Windows and Linux drivers (and firmware) were pretty much the same (he mentioned there were similarities in terminology), but then they switched to that new API on Windows and changed the firmware with it, so that the driver now essentially only forwards the commands from that API to the firmware and the firmware handles the rest. By reading the Windows docs on that API, that change might have been forced on them as some Windows 10 features apparently only work via that API. He'll probably know more about that than I do.
On 7/9/21 7:30 PM, Pali Rohár wrote: > On Friday 09 July 2021 19:03:37 Maximilian Luz wrote: >> On 7/9/21 6:12 PM, Pali Rohár wrote: >> >> [...] >> >>>>> Hello! Now I'm thinking loudly about this patch. Why this kind of reset >>>>> is needed only for Surface devices? AFAIK these 88W8897 chips are same >>>>> in all cards. Chip itself implements PCIe interface (and also SDIO) so >>>>> for me looks very strange if this 88W8897 PCIe device needs DMI specific >>>>> quirks. I cannot believe that Microsoft got some special version of >>>>> these chips from Marvell which are different than version uses on cards >>>>> in mPCIe form factor. >>>>> >>>>> And now when I'm reading comment below about PCIe bridge to which is >>>>> this 88W8897 PCIe chip connected, is not this rather an issue in that >>>>> PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which >>>>> controls this bridge? >>>>> >>>>> Or are having other people same issues on mPCIe form factor wifi cards >>>>> with 88W8897 chips and then this quirk should not DMI dependent? >>>>> >>>>> Note that I'm seeing issues with reset and other things also on chip >>>>> 88W8997 when is connected to system via SDIO. These chips have both PCIe >>>>> and SDIO buses, it just depends which pins are used. >>>>> >>>> >>>> Hi and thanks for the quick reply! Honestly I've no idea, this is just the >>>> first method we found that allows for a proper reset of the chip. What I >>>> know is that some Surface devices need that ACPI DSM call (the one that was >>>> done in the commit I dropped in this version of the patchset) to reset the >>>> chip instead of this method. >>>> >>>> Afaik other devices with this chip don't need this resetting method, at >>>> least Marvell employees couldn't reproduce the issues on their testing >>>> devices. >>>> >>>> So would you suggest we just try to match for the pci chip 88W8897 instead? >>> >>> Hello! Such suggestion makes sense when we know that it is 88W8897 >>> issue. But if you got information that issue cannot be reproduced on >>> other 88W8897 cards then matching 88W8897 is not correct. >>> >>> From all this information looks like that it is problem in (Microsoft?) >>> PCIe bridge to which is card connected. Otherwise I do not reason how it >>> can be 88W8897 affected. Either it is reproducible on 88W8897 cards also >>> in other devices or issue is not on 88W8897 card. >> >> I doubt that it's an issue with the PCIe bridge (itself at least). The >> same type of bridge is used for both dGPU and NVME SSD on my device (see >> lspci output below) and those work fine. Also if I'm seeing that right >> it's from the Intel CPU, so my guess is that a lot more people would >> have issues with that then. > > From information below it seems to be related to surprise removal. > Therefore is surprise removal working without issue for dGPU or NVME > SSD? Not all PCIe bridges support surprise removal... > >> >> I don't know about the hardware side, so it might be possible that it's >> an issue with integrating both bridge and wifi chip, in which case it's >> still probably best handled via DMI quirks unless we know more. >> >> Also as Tsuchiya mentioned in his original submission, on Windows the >> device is reset via this D3cold method. I've only skimmed that >> errata.inf file mentioned, but I think this is what he's referring to: >> >> Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be >> evaluated or not on a given platform. Currently >> ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be >> evaluated on Surface platforms which contain the Marvell WiFi >> controller which depends on device going through D3Cold as part of >> surprise-removal. >> >> and >> >> Starting with Windows releases *after* Blue, ACPI will not put >> surprise-removed devices into D3Cold automatically. Some known >> scenarios (viz. WiFi reset/recovery) rely on the device cycling >> through D3Cold on surprise-removal. This hack allows surprise-removed >> devices to be put into D3Cold (if supported by the stack). >> >> So, as far as I can tell, the chip doesn't like to be surprise-removed >> (which seems to happen during reset) and then needs to be power-cycled, >> which I think is likely due to some issue with firmware state. > > Thanks for information. This really does not look like PCIe bridge > specific if bridge itself can handle surprise-removed devices. lspci can > tell us if bridge supports it or not (see below). > >> So the quirk on Windows seems very Surface specific. >> >> There also seem a bunch of revisions of these chips around, for example >> my SB2 is affected by a bug that we've tied to the specific hardware >> revision which causes some issues with host-sleep (IIRC chip switches >> rapidly between wake and sleep states without any external influence, >> which is not how it should behave and how it does behave on a later >> hardware revision). > > Interesting... This looks like the issue can be in 88W8897 chip and > needs some special conditions to trigger? And Surface is triggering it > always? The specific issue was that the card wakes up very quickly after going into deep sleep (deep sleep is a state the firmware enters when idle and not connected to an AP). Now this in turn messed with the host suspending, because deep sleep is involved there and the card is not expected to wake up that quickly again (I'm oversimplifying here, it's also been some time since we looked into it). Anyway, in the end those wakeups from deep sleep only happened with one hardware revision of the card (I guess it's caused by a hardware design issue like a floating gpio or something), and we managed to fix the problems by disabling deep sleep on that hardware revision, but as Max mentioned that problem is completely unrelated from this patch. > >>>> Then we'd probably have to check if there are any laptops where multiple >>>> devices are connected to the pci bridge as Amey suggested in a review >>>> before. >>> >>> Well, I do not know... But if this is issue with PCIe bridge then >>> similar issue could be observed also for other PCIe devices with this >>> PCIe bridge. But question is if there are other laptops with this PCIe >>> bridge. And also it can be a problem in ACPI firmware on those Surface >>> devices, which implements some PCIe bridge functionality. So it is >>> possible that issue is with PCIe bridge, not in HW, but in SW/firmware >>> part which can be Microsoft specific... So too many questions to which >>> we do not know answers. >>> >>> Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on >>> affected machines? If you have already sent it in some previous email, >>> just send a link. At least I'm not able to find it right now and output >>> may contain something useful... >> >> From my Surface Book 2 (with the same issue): >> >> - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/ >> - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/ > > Could you re-run lspci under root account? There are missing important > parts like "Capabilities: <access denied>" where is information if > bridge supports surprise removal or not. > >> Regards, >> Max
On 7/10/21 3:07 AM, Maximilian Luz wrote: > On 7/10/21 2:38 AM, Pali Rohár wrote: >> On Saturday 10 July 2021 02:18:12 Maximilian Luz wrote: >>> On 7/10/21 2:07 AM, Pali Rohár wrote: >>> >>> [...] >>> >>>>> Interesting, I was not aware of this. IIRC we've been experimenting >>>>> with >>>>> the mwlwifi driver (which that lrdmwl driver seems to be based >>>>> on?), but >>>>> couldn't get that to work with the firmware we have. >>>> >>>> mwlwifi is that soft-mac driver and uses completely different firmware. >>>> For sure it would not work with current full-mac firmware. >>>> >>>>> IIRC it also didn't >>>>> work with the Windows firmware (which seems to be significantly >>>>> different from the one we have for Linux and seems to use or be >>>>> modeled >>>>> after some special Windows WiFi driver interface). >>>> >>>> So... Microsoft has different firmware for this chip? And it is working >>>> with mwifiex driver? >>> >>> I'm not sure how special that firmware really is (i.e. if it is Surface >>> specific or just what Marvell uses on Windows), only that it doesn't >>> look like the firmware included in the linux-firmware repo. The Windows >>> firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and >>> on Linux we use the official firmware from the linux-firmware repo. >> >> Version available in the linux-firmware repo is also what big companies >> (like google) receive for their systems... sometimes just only older >> version as Marvell/NXP is slow in updating files in linux-firmware. >> Seems that it is also same what receive customers under NDA as more >> companies dropped "proprietary" ex-Marvell/NXP driver on internet and it >> contained this firmware with some sources of driver which looks like a >> fork of mwifiex (or maybe mwifiex is "cleaned fork" of that driver :D) >> >> There is old firmware documentation which describe RPC communication >> between OS and firmware: >> http://wiki.laptop.org/images/f/f3/Firmware-Spec-v5.1-MV-S103752-00.pdf >> >> It is really old for very old wifi chips and when I checked it, it still >> matches what mwifiex is doing with new chips. Just there are new and >> more commands. And documentation is OS-neutral. >> >> So if Microsoft has some "incompatible" firmware with this, it could >> mean that they got something special which nobody else have? Maybe it >> can explain that "connected standby" and maybe also better stability? >> >> Or just windows distribute firmware in different format and needs to >> "unpack" or "preprocess" prior downloading it to device? > > If memory serves me right, Jonas did some reverse engineering on the > Windows driver and found that it uses the "new" WDI Miniport API: It > seems that originally both Windows and Linux drivers (and firmware) > were pretty much the same (he mentioned there were similarities in > terminology), but then they switched to that new API on Windows and > changed the firmware with it, so that the driver now essentially only > forwards the commands from that API to the firmware and the firmware > handles the rest. > > By reading the Windows docs on that API, that change might have been > forced on them as some Windows 10 features apparently only work via > that API. > > He'll probably know more about that than I do. Not much I can add there, it seemed a lot like both mwifiex and the Windows 10 WDI miniport driver were both derived from the same codebase originally, but in order to be compatible with the WDI miniport API and other stuff Windows requires from wifi devices (I recall there was some SAR-value control/reporting stuff too), some parts of the firmware had to be rewritten. In the end, the Windows firmware is updated a lot more often and likely includes a bunch of bugfixes the linux firmware doesn't have, but it can't be used on linux without a ton of work that would probably include rebuilding proprietary APIs from Windows. Also, from my testing with custom APs and sniffing packets with Wireshark, the functionality, limitations and weird "semi-spec-compliant" behaviors were exactly the same with the Windows firmware: It doesn't support WPA3, it can't connect to fast transition APs (funnily enough that's opposed to what MS claims) and it also can't spawn an AP with WPA-PSK-SHA256 AKM ciphers. So not sure there's a lot of sense in spending more time trying to go down that path.
On Sunday 11 July 2021 18:53:32 Jonas Dreßler wrote: > On 7/10/21 3:07 AM, Maximilian Luz wrote: > > On 7/10/21 2:38 AM, Pali Rohár wrote: > > > On Saturday 10 July 2021 02:18:12 Maximilian Luz wrote: > > > > On 7/10/21 2:07 AM, Pali Rohár wrote: > > > > > > > > [...] > > > > > > > > > > Interesting, I was not aware of this. IIRC we've been > > > > > > experimenting with > > > > > > the mwlwifi driver (which that lrdmwl driver seems to be > > > > > > based on?), but > > > > > > couldn't get that to work with the firmware we have. > > > > > > > > > > mwlwifi is that soft-mac driver and uses completely different firmware. > > > > > For sure it would not work with current full-mac firmware. > > > > > > > > > > > IIRC it also didn't > > > > > > work with the Windows firmware (which seems to be significantly > > > > > > different from the one we have for Linux and seems to > > > > > > use or be modeled > > > > > > after some special Windows WiFi driver interface). > > > > > > > > > > So... Microsoft has different firmware for this chip? And it is working > > > > > with mwifiex driver? > > > > > > > > I'm not sure how special that firmware really is (i.e. if it is Surface > > > > specific or just what Marvell uses on Windows), only that it doesn't > > > > look like the firmware included in the linux-firmware repo. The Windows > > > > firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and > > > > on Linux we use the official firmware from the linux-firmware repo. > > > > > > Version available in the linux-firmware repo is also what big companies > > > (like google) receive for their systems... sometimes just only older > > > version as Marvell/NXP is slow in updating files in linux-firmware. > > > Seems that it is also same what receive customers under NDA as more > > > companies dropped "proprietary" ex-Marvell/NXP driver on internet and it > > > contained this firmware with some sources of driver which looks like a > > > fork of mwifiex (or maybe mwifiex is "cleaned fork" of that driver :D) > > > > > > There is old firmware documentation which describe RPC communication > > > between OS and firmware: > > > http://wiki.laptop.org/images/f/f3/Firmware-Spec-v5.1-MV-S103752-00.pdf > > > > > > It is really old for very old wifi chips and when I checked it, it still > > > matches what mwifiex is doing with new chips. Just there are new and > > > more commands. And documentation is OS-neutral. > > > > > > So if Microsoft has some "incompatible" firmware with this, it could > > > mean that they got something special which nobody else have? Maybe it > > > can explain that "connected standby" and maybe also better stability? > > > > > > Or just windows distribute firmware in different format and needs to > > > "unpack" or "preprocess" prior downloading it to device? > > > > If memory serves me right, Jonas did some reverse engineering on the > > Windows driver and found that it uses the "new" WDI Miniport API: It > > seems that originally both Windows and Linux drivers (and firmware) > > were pretty much the same (he mentioned there were similarities in > > terminology), but then they switched to that new API on Windows and > > changed the firmware with it, so that the driver now essentially only > > forwards the commands from that API to the firmware and the firmware > > handles the rest. > > > > By reading the Windows docs on that API, that change might have been > > forced on them as some Windows 10 features apparently only work via > > that API. > > > > He'll probably know more about that than I do. > > Not much I can add there, it seemed a lot like both mwifiex and the Windows > 10 WDI miniport driver were both derived from the same codebase originally, > but in order to be compatible with the WDI miniport API and other stuff > Windows requires from wifi devices (I recall there was some SAR-value > control/reporting stuff too), some parts of the firmware had to be > rewritten. > > In the end, the Windows firmware is updated a lot more often and likely > includes a bunch of bugfixes the linux firmware doesn't have, but it can't > be used on linux without a ton of work that would probably include > rebuilding proprietary APIs from Windows. > > Also, from my testing with custom APs and sniffing packets with Wireshark, > the functionality, limitations and weird "semi-spec-compliant" behaviors > were exactly the same with the Windows firmware: It doesn't support WPA3, it > can't connect to fast transition APs (funnily enough that's opposed to what > MS claims) and it also can't spawn an AP with WPA-PSK-SHA256 AKM ciphers. So > not sure there's a lot of sense in spending more time trying to go down that > path. New version of firmware files are available on NXP portal, where are updated more frequently, but only for companies which have NXP accounts and signed NDA with NXP. Not for end users. If you want these new firmware files, you need to ask NXP developers as only they can ask for non-NDA distribution and include new version into linux-firmware repository. Like in this pull request where is new SDIO firmware for 88W8897: https://lore.kernel.org/linux-firmware/DB7PR04MB453855B0D6C41923BCB0922EFC1C9@DB7PR04MB4538.eurprd04.prod.outlook.com/
diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c index a530832c9421..c6ccce426b49 100644 --- a/drivers/net/wireless/marvell/mwifiex/pcie.c +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c @@ -528,6 +528,13 @@ static void mwifiex_pcie_reset_prepare(struct pci_dev *pdev) mwifiex_shutdown_sw(adapter); clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags); clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags); + + /* On MS Surface gen4+ devices FLR isn't effective to recover from + * hangups, so we power-cycle the card instead. + */ + if (card->quirks & QUIRK_FW_RST_D3COLD) + mwifiex_pcie_reset_d3cold_quirk(pdev); + mwifiex_dbg(adapter, INFO, "%s, successful\n", __func__); card->pci_reset_ongoing = true; diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c index 4064f99b36ba..b5f214fc1212 100644 --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c @@ -15,6 +15,72 @@ /* quirk table based on DMI matching */ static const struct dmi_system_id mwifiex_quirk_table[] = { + { + .ident = "Surface Pro 4", + .matches = { + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 4"), + }, + .driver_data = (void *)QUIRK_FW_RST_D3COLD, + }, + { + .ident = "Surface Pro 5", + .matches = { + /* match for SKU here due to generic product name "Surface Pro" */ + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), + DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1796"), + }, + .driver_data = (void *)QUIRK_FW_RST_D3COLD, + }, + { + .ident = "Surface Pro 5 (LTE)", + .matches = { + /* match for SKU here due to generic product name "Surface Pro" */ + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), + DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1807"), + }, + .driver_data = (void *)QUIRK_FW_RST_D3COLD, + }, + { + .ident = "Surface Pro 6", + .matches = { + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 6"), + }, + .driver_data = (void *)QUIRK_FW_RST_D3COLD, + }, + { + .ident = "Surface Book 1", + .matches = { + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book"), + }, + .driver_data = (void *)QUIRK_FW_RST_D3COLD, + }, + { + .ident = "Surface Book 2", + .matches = { + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book 2"), + }, + .driver_data = (void *)QUIRK_FW_RST_D3COLD, + }, + { + .ident = "Surface Laptop 1", + .matches = { + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop"), + }, + .driver_data = (void *)QUIRK_FW_RST_D3COLD, + }, + { + .ident = "Surface Laptop 2", + .matches = { + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop 2"), + }, + .driver_data = (void *)QUIRK_FW_RST_D3COLD, + }, {} }; @@ -29,4 +95,61 @@ void mwifiex_initialize_quirks(struct pcie_service_card *card) if (!card->quirks) dev_info(&pdev->dev, "no quirks enabled\n"); + if (card->quirks & QUIRK_FW_RST_D3COLD) + dev_info(&pdev->dev, "quirk reset_d3cold enabled\n"); +} + +static void mwifiex_pcie_set_power_d3cold(struct pci_dev *pdev) +{ + dev_info(&pdev->dev, "putting into D3cold...\n"); + + pci_save_state(pdev); + if (pci_is_enabled(pdev)) + pci_disable_device(pdev); + pci_set_power_state(pdev, PCI_D3cold); +} + +static int mwifiex_pcie_set_power_d0(struct pci_dev *pdev) +{ + int ret; + + dev_info(&pdev->dev, "putting into D0...\n"); + + pci_set_power_state(pdev, PCI_D0); + ret = pci_enable_device(pdev); + if (ret) { + dev_err(&pdev->dev, "pci_enable_device failed\n"); + return ret; + } + pci_restore_state(pdev); + + return 0; +} + +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev) +{ + struct pci_dev *parent_pdev = pci_upstream_bridge(pdev); + int ret; + + /* Power-cycle (put into D3cold then D0) */ + dev_info(&pdev->dev, "Using reset_d3cold quirk to perform FW reset\n"); + + /* We need to perform power-cycle also for bridge of wifi because + * on some devices (e.g. Surface Book 1), the OS for some reasons + * can't know the real power state of the bridge. + * When tried to power-cycle only wifi, the reset failed with the + * following dmesg log: + * "Cannot transition to power state D0 for parent in D3hot". + */ + mwifiex_pcie_set_power_d3cold(pdev); + mwifiex_pcie_set_power_d3cold(parent_pdev); + + ret = mwifiex_pcie_set_power_d0(parent_pdev); + if (ret) + return ret; + ret = mwifiex_pcie_set_power_d0(pdev); + if (ret) + return ret; + + return 0; } diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h index 7a1fe3b3a61a..549093067813 100644 --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h @@ -5,4 +5,7 @@ #include "pcie.h" +#define QUIRK_FW_RST_D3COLD BIT(0) + void mwifiex_initialize_quirks(struct pcie_service_card *card); +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev);