diff mbox series

[v2,2/2] mwifiex: pcie: add reset_d3cold quirk for Surface gen4+ devices

Message ID 20210709145831.6123-3-verdre@v0yd.nl (mailing list archive)
State Changes Requested
Delegated to: Kalle Valo
Headers show
Series mwifiex: Add quirks for MS Surface devices | expand

Commit Message

Jonas Dreßler July 9, 2021, 2:58 p.m. UTC
From: Tsuchiya Yuto <kitakar@gmail.com>

To reset mwifiex on Surface gen4+ (Pro 4 or later gen) devices, it
seems that putting the wifi device into D3cold is required according
to errata.inf file on Windows installation (Windows/INF/errata.inf).

This patch adds a function that performs power-cycle (put into D3cold
then D0) and call the function at the end of reset_prepare().

Note: Need to also reset the parent device (bridge) of wifi on SB1;
it might be because the bridge of wifi always reports it's in D3hot.
When I tried to reset only the wifi device (not touching parent), it gave
the following error and the reset failed:

    acpi device:4b: Cannot transition to power state D0 for parent in D3hot
    mwifiex_pcie 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible)

Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>
Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>
---
 drivers/net/wireless/marvell/mwifiex/pcie.c   |   7 +
 .../wireless/marvell/mwifiex/pcie_quirks.c    | 123 ++++++++++++++++++
 .../wireless/marvell/mwifiex/pcie_quirks.h    |   3 +
 3 files changed, 133 insertions(+)

Comments

Pali Rohár July 9, 2021, 3:18 p.m. UTC | #1
On Friday 09 July 2021 16:58:31 Jonas Dreßler wrote:
> From: Tsuchiya Yuto <kitakar@gmail.com>
> 
> To reset mwifiex on Surface gen4+ (Pro 4 or later gen) devices, it
> seems that putting the wifi device into D3cold is required according
> to errata.inf file on Windows installation (Windows/INF/errata.inf).
> 
> This patch adds a function that performs power-cycle (put into D3cold
> then D0) and call the function at the end of reset_prepare().
> 
> Note: Need to also reset the parent device (bridge) of wifi on SB1;
> it might be because the bridge of wifi always reports it's in D3hot.
> When I tried to reset only the wifi device (not touching parent), it gave
> the following error and the reset failed:
> 
>     acpi device:4b: Cannot transition to power state D0 for parent in D3hot
>     mwifiex_pcie 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible)
> 
> Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>
> Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>
> ---
>  drivers/net/wireless/marvell/mwifiex/pcie.c   |   7 +
>  .../wireless/marvell/mwifiex/pcie_quirks.c    | 123 ++++++++++++++++++
>  .../wireless/marvell/mwifiex/pcie_quirks.h    |   3 +
>  3 files changed, 133 insertions(+)
> 
> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c
> index a530832c9421..c6ccce426b49 100644
> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
> @@ -528,6 +528,13 @@ static void mwifiex_pcie_reset_prepare(struct pci_dev *pdev)
>  	mwifiex_shutdown_sw(adapter);
>  	clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags);
>  	clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags);
> +
> +	/* On MS Surface gen4+ devices FLR isn't effective to recover from
> +	 * hangups, so we power-cycle the card instead.
> +	 */
> +	if (card->quirks & QUIRK_FW_RST_D3COLD)
> +		mwifiex_pcie_reset_d3cold_quirk(pdev);
> +

Hello! Now I'm thinking loudly about this patch. Why this kind of reset
is needed only for Surface devices? AFAIK these 88W8897 chips are same
in all cards. Chip itself implements PCIe interface (and also SDIO) so
for me looks very strange if this 88W8897 PCIe device needs DMI specific
quirks. I cannot believe that Microsoft got some special version of
these chips from Marvell which are different than version uses on cards
in mPCIe form factor.

And now when I'm reading comment below about PCIe bridge to which is
this 88W8897 PCIe chip connected, is not this rather an issue in that
PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
controls this bridge?

Or are having other people same issues on mPCIe form factor wifi cards
with 88W8897 chips and then this quirk should not DMI dependent?

Note that I'm seeing issues with reset and other things also on chip
88W8997 when is connected to system via SDIO. These chips have both PCIe
and SDIO buses, it just depends which pins are used.

>  	mwifiex_dbg(adapter, INFO, "%s, successful\n", __func__);
>  
>  	card->pci_reset_ongoing = true;
> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
> index 4064f99b36ba..b5f214fc1212 100644
> --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
> +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
> @@ -15,6 +15,72 @@
>  
>  /* quirk table based on DMI matching */
>  static const struct dmi_system_id mwifiex_quirk_table[] = {
> +	{
> +		.ident = "Surface Pro 4",
> +		.matches = {
> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 4"),
> +		},
> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> +	},
> +	{
> +		.ident = "Surface Pro 5",
> +		.matches = {
> +			/* match for SKU here due to generic product name "Surface Pro" */
> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> +			DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1796"),
> +		},
> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> +	},
> +	{
> +		.ident = "Surface Pro 5 (LTE)",
> +		.matches = {
> +			/* match for SKU here due to generic product name "Surface Pro" */
> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> +			DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1807"),
> +		},
> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> +	},
> +	{
> +		.ident = "Surface Pro 6",
> +		.matches = {
> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 6"),
> +		},
> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> +	},
> +	{
> +		.ident = "Surface Book 1",
> +		.matches = {
> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book"),
> +		},
> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> +	},
> +	{
> +		.ident = "Surface Book 2",
> +		.matches = {
> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book 2"),
> +		},
> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> +	},
> +	{
> +		.ident = "Surface Laptop 1",
> +		.matches = {
> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop"),
> +		},
> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> +	},
> +	{
> +		.ident = "Surface Laptop 2",
> +		.matches = {
> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop 2"),
> +		},
> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> +	},
>  	{}
>  };
>  
> @@ -29,4 +95,61 @@ void mwifiex_initialize_quirks(struct pcie_service_card *card)
>  
>  	if (!card->quirks)
>  		dev_info(&pdev->dev, "no quirks enabled\n");
> +	if (card->quirks & QUIRK_FW_RST_D3COLD)
> +		dev_info(&pdev->dev, "quirk reset_d3cold enabled\n");
> +}
> +
> +static void mwifiex_pcie_set_power_d3cold(struct pci_dev *pdev)
> +{
> +	dev_info(&pdev->dev, "putting into D3cold...\n");
> +
> +	pci_save_state(pdev);
> +	if (pci_is_enabled(pdev))
> +		pci_disable_device(pdev);
> +	pci_set_power_state(pdev, PCI_D3cold);
> +}
> +
> +static int mwifiex_pcie_set_power_d0(struct pci_dev *pdev)
> +{
> +	int ret;
> +
> +	dev_info(&pdev->dev, "putting into D0...\n");
> +
> +	pci_set_power_state(pdev, PCI_D0);
> +	ret = pci_enable_device(pdev);
> +	if (ret) {
> +		dev_err(&pdev->dev, "pci_enable_device failed\n");
> +		return ret;
> +	}
> +	pci_restore_state(pdev);
> +
> +	return 0;
> +}
> +
> +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev)
> +{
> +	struct pci_dev *parent_pdev = pci_upstream_bridge(pdev);
> +	int ret;
> +
> +	/* Power-cycle (put into D3cold then D0) */
> +	dev_info(&pdev->dev, "Using reset_d3cold quirk to perform FW reset\n");
> +
> +	/* We need to perform power-cycle also for bridge of wifi because
> +	 * on some devices (e.g. Surface Book 1), the OS for some reasons
> +	 * can't know the real power state of the bridge.
> +	 * When tried to power-cycle only wifi, the reset failed with the
> +	 * following dmesg log:
> +	 * "Cannot transition to power state D0 for parent in D3hot".
> +	 */
> +	mwifiex_pcie_set_power_d3cold(pdev);
> +	mwifiex_pcie_set_power_d3cold(parent_pdev);
> +
> +	ret = mwifiex_pcie_set_power_d0(parent_pdev);
> +	if (ret)
> +		return ret;
> +	ret = mwifiex_pcie_set_power_d0(pdev);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
>  }
> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
> index 7a1fe3b3a61a..549093067813 100644
> --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
> +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
> @@ -5,4 +5,7 @@
>  
>  #include "pcie.h"
>  
> +#define QUIRK_FW_RST_D3COLD	BIT(0)
> +
>  void mwifiex_initialize_quirks(struct pcie_service_card *card);
> +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev);
> -- 
> 2.31.1
>
Jonas Dreßler July 9, 2021, 3:33 p.m. UTC | #2
On 7/9/21 5:18 PM, Pali Rohár wrote:
> On Friday 09 July 2021 16:58:31 Jonas Dreßler wrote:
>> From: Tsuchiya Yuto <kitakar@gmail.com>
>>
>> To reset mwifiex on Surface gen4+ (Pro 4 or later gen) devices, it
>> seems that putting the wifi device into D3cold is required according
>> to errata.inf file on Windows installation (Windows/INF/errata.inf).
>>
>> This patch adds a function that performs power-cycle (put into D3cold
>> then D0) and call the function at the end of reset_prepare().
>>
>> Note: Need to also reset the parent device (bridge) of wifi on SB1;
>> it might be because the bridge of wifi always reports it's in D3hot.
>> When I tried to reset only the wifi device (not touching parent), it gave
>> the following error and the reset failed:
>>
>>      acpi device:4b: Cannot transition to power state D0 for parent in D3hot
>>      mwifiex_pcie 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible)
>>
>> Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>
>> Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>
>> ---
>>   drivers/net/wireless/marvell/mwifiex/pcie.c   |   7 +
>>   .../wireless/marvell/mwifiex/pcie_quirks.c    | 123 ++++++++++++++++++
>>   .../wireless/marvell/mwifiex/pcie_quirks.h    |   3 +
>>   3 files changed, 133 insertions(+)
>>
>> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c
>> index a530832c9421..c6ccce426b49 100644
>> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
>> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
>> @@ -528,6 +528,13 @@ static void mwifiex_pcie_reset_prepare(struct pci_dev *pdev)
>>   	mwifiex_shutdown_sw(adapter);
>>   	clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags);
>>   	clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags);
>> +
>> +	/* On MS Surface gen4+ devices FLR isn't effective to recover from
>> +	 * hangups, so we power-cycle the card instead.
>> +	 */
>> +	if (card->quirks & QUIRK_FW_RST_D3COLD)
>> +		mwifiex_pcie_reset_d3cold_quirk(pdev);
>> +
> 
> Hello! Now I'm thinking loudly about this patch. Why this kind of reset
> is needed only for Surface devices? AFAIK these 88W8897 chips are same
> in all cards. Chip itself implements PCIe interface (and also SDIO) so
> for me looks very strange if this 88W8897 PCIe device needs DMI specific
> quirks. I cannot believe that Microsoft got some special version of
> these chips from Marvell which are different than version uses on cards
> in mPCIe form factor.
> 
> And now when I'm reading comment below about PCIe bridge to which is
> this 88W8897 PCIe chip connected, is not this rather an issue in that
> PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
> controls this bridge?
> 
> Or are having other people same issues on mPCIe form factor wifi cards
> with 88W8897 chips and then this quirk should not DMI dependent?
> 
> Note that I'm seeing issues with reset and other things also on chip
> 88W8997 when is connected to system via SDIO. These chips have both PCIe
> and SDIO buses, it just depends which pins are used.
> 

Hi and thanks for the quick reply! Honestly I've no idea, this is just 
the first method we found that allows for a proper reset of the chip. 
What I know is that some Surface devices need that ACPI DSM call (the 
one that was done in the commit I dropped in this version of the 
patchset) to reset the chip instead of this method.

Afaik other devices with this chip don't need this resetting method, at 
least Marvell employees couldn't reproduce the issues on their testing 
devices.

So would you suggest we just try to match for the pci chip 88W8897 
instead? Then we'd probably have to check if there are any laptops where 
multiple devices are connected to the pci bridge as Amey suggested in a 
review before.

>>   	mwifiex_dbg(adapter, INFO, "%s, successful\n", __func__);
>>   
>>   	card->pci_reset_ongoing = true;
>> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
>> index 4064f99b36ba..b5f214fc1212 100644
>> --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
>> +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
>> @@ -15,6 +15,72 @@
>>   
>>   /* quirk table based on DMI matching */
>>   static const struct dmi_system_id mwifiex_quirk_table[] = {
>> +	{
>> +		.ident = "Surface Pro 4",
>> +		.matches = {
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 4"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Pro 5",
>> +		.matches = {
>> +			/* match for SKU here due to generic product name "Surface Pro" */
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1796"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Pro 5 (LTE)",
>> +		.matches = {
>> +			/* match for SKU here due to generic product name "Surface Pro" */
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1807"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Pro 6",
>> +		.matches = {
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 6"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Book 1",
>> +		.matches = {
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Book 2",
>> +		.matches = {
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book 2"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Laptop 1",
>> +		.matches = {
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>> +	{
>> +		.ident = "Surface Laptop 2",
>> +		.matches = {
>> +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
>> +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop 2"),
>> +		},
>> +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
>> +	},
>>   	{}
>>   };
>>   
>> @@ -29,4 +95,61 @@ void mwifiex_initialize_quirks(struct pcie_service_card *card)
>>   
>>   	if (!card->quirks)
>>   		dev_info(&pdev->dev, "no quirks enabled\n");
>> +	if (card->quirks & QUIRK_FW_RST_D3COLD)
>> +		dev_info(&pdev->dev, "quirk reset_d3cold enabled\n");
>> +}
>> +
>> +static void mwifiex_pcie_set_power_d3cold(struct pci_dev *pdev)
>> +{
>> +	dev_info(&pdev->dev, "putting into D3cold...\n");
>> +
>> +	pci_save_state(pdev);
>> +	if (pci_is_enabled(pdev))
>> +		pci_disable_device(pdev);
>> +	pci_set_power_state(pdev, PCI_D3cold);
>> +}
>> +
>> +static int mwifiex_pcie_set_power_d0(struct pci_dev *pdev)
>> +{
>> +	int ret;
>> +
>> +	dev_info(&pdev->dev, "putting into D0...\n");
>> +
>> +	pci_set_power_state(pdev, PCI_D0);
>> +	ret = pci_enable_device(pdev);
>> +	if (ret) {
>> +		dev_err(&pdev->dev, "pci_enable_device failed\n");
>> +		return ret;
>> +	}
>> +	pci_restore_state(pdev);
>> +
>> +	return 0;
>> +}
>> +
>> +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev)
>> +{
>> +	struct pci_dev *parent_pdev = pci_upstream_bridge(pdev);
>> +	int ret;
>> +
>> +	/* Power-cycle (put into D3cold then D0) */
>> +	dev_info(&pdev->dev, "Using reset_d3cold quirk to perform FW reset\n");
>> +
>> +	/* We need to perform power-cycle also for bridge of wifi because
>> +	 * on some devices (e.g. Surface Book 1), the OS for some reasons
>> +	 * can't know the real power state of the bridge.
>> +	 * When tried to power-cycle only wifi, the reset failed with the
>> +	 * following dmesg log:
>> +	 * "Cannot transition to power state D0 for parent in D3hot".
>> +	 */
>> +	mwifiex_pcie_set_power_d3cold(pdev);
>> +	mwifiex_pcie_set_power_d3cold(parent_pdev);
>> +
>> +	ret = mwifiex_pcie_set_power_d0(parent_pdev);
>> +	if (ret)
>> +		return ret;
>> +	ret = mwifiex_pcie_set_power_d0(pdev);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return 0;
>>   }
>> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
>> index 7a1fe3b3a61a..549093067813 100644
>> --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
>> +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
>> @@ -5,4 +5,7 @@
>>   
>>   #include "pcie.h"
>>   
>> +#define QUIRK_FW_RST_D3COLD	BIT(0)
>> +
>>   void mwifiex_initialize_quirks(struct pcie_service_card *card);
>> +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev);
>> -- 
>> 2.31.1
>>
Andy Shevchenko July 9, 2021, 4:01 p.m. UTC | #3
On Fri, Jul 9, 2021 at 6:18 PM Pali Rohár <pali@kernel.org> wrote:
> On Friday 09 July 2021 16:58:31 Jonas Dreßler wrote:


> Hello! Now I'm thinking loudly about this patch. Why this kind of reset
> is needed only for Surface devices? AFAIK these 88W8897 chips are same
> in all cards. Chip itself implements PCIe interface (and also SDIO) so
> for me looks very strange if this 88W8897 PCIe device needs DMI specific
> quirks. I cannot believe that Microsoft got some special version of
> these chips from Marvell which are different than version uses on cards
> in mPCIe form factor.
>
> And now when I'm reading comment below about PCIe bridge to which is
> this 88W8897 PCIe chip connected, is not this rather an issue in that
> PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
> controls this bridge?
>
> Or are having other people same issues on mPCIe form factor wifi cards
> with 88W8897 chips and then this quirk should not DMI dependent?
>
> Note that I'm seeing issues with reset and other things also on chip
> 88W8997 when is connected to system via SDIO. These chips have both PCIe
> and SDIO buses, it just depends which pins are used.

I'm replying loudly :-)

You know that depending on the interface the firmware even for the
same chip may be way different. And if you have had any experience
working in product companies you should know well that bug in product
X is not gonna be fixed if it was not reported, but gets fixed on
product Y due to that. Besides that, how do you know that MS has not
been given the special edition of the FW?

As icing on the cake, the Marvell has been bought and I believe they
abandoned their products quite a while ago. You may read kernel
bugzilla for the details (last Marvell developer who answered to the
reports seems has no clue about the driver).

All that said, I believe that we have to follow whatever fixes we
would have and be thankful to contributors and testers.

For the record, I've been suffering from the Linux driver of this
hardware for a while. And I'm fully in support of any quirks that will
help UX.
Pali Rohár July 9, 2021, 4:12 p.m. UTC | #4
On Friday 09 July 2021 17:33:34 Jonas Dreßler wrote:
> On 7/9/21 5:18 PM, Pali Rohár wrote:
> > On Friday 09 July 2021 16:58:31 Jonas Dreßler wrote:
> > > From: Tsuchiya Yuto <kitakar@gmail.com>
> > > 
> > > To reset mwifiex on Surface gen4+ (Pro 4 or later gen) devices, it
> > > seems that putting the wifi device into D3cold is required according
> > > to errata.inf file on Windows installation (Windows/INF/errata.inf).
> > > 
> > > This patch adds a function that performs power-cycle (put into D3cold
> > > then D0) and call the function at the end of reset_prepare().
> > > 
> > > Note: Need to also reset the parent device (bridge) of wifi on SB1;
> > > it might be because the bridge of wifi always reports it's in D3hot.
> > > When I tried to reset only the wifi device (not touching parent), it gave
> > > the following error and the reset failed:
> > > 
> > >      acpi device:4b: Cannot transition to power state D0 for parent in D3hot
> > >      mwifiex_pcie 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible)
> > > 
> > > Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>
> > > Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>
> > > ---
> > >   drivers/net/wireless/marvell/mwifiex/pcie.c   |   7 +
> > >   .../wireless/marvell/mwifiex/pcie_quirks.c    | 123 ++++++++++++++++++
> > >   .../wireless/marvell/mwifiex/pcie_quirks.h    |   3 +
> > >   3 files changed, 133 insertions(+)
> > > 
> > > diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c
> > > index a530832c9421..c6ccce426b49 100644
> > > --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
> > > +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
> > > @@ -528,6 +528,13 @@ static void mwifiex_pcie_reset_prepare(struct pci_dev *pdev)
> > >   	mwifiex_shutdown_sw(adapter);
> > >   	clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags);
> > >   	clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags);
> > > +
> > > +	/* On MS Surface gen4+ devices FLR isn't effective to recover from
> > > +	 * hangups, so we power-cycle the card instead.
> > > +	 */
> > > +	if (card->quirks & QUIRK_FW_RST_D3COLD)
> > > +		mwifiex_pcie_reset_d3cold_quirk(pdev);
> > > +
> > 
> > Hello! Now I'm thinking loudly about this patch. Why this kind of reset
> > is needed only for Surface devices? AFAIK these 88W8897 chips are same
> > in all cards. Chip itself implements PCIe interface (and also SDIO) so
> > for me looks very strange if this 88W8897 PCIe device needs DMI specific
> > quirks. I cannot believe that Microsoft got some special version of
> > these chips from Marvell which are different than version uses on cards
> > in mPCIe form factor.
> > 
> > And now when I'm reading comment below about PCIe bridge to which is
> > this 88W8897 PCIe chip connected, is not this rather an issue in that
> > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
> > controls this bridge?
> > 
> > Or are having other people same issues on mPCIe form factor wifi cards
> > with 88W8897 chips and then this quirk should not DMI dependent?
> > 
> > Note that I'm seeing issues with reset and other things also on chip
> > 88W8997 when is connected to system via SDIO. These chips have both PCIe
> > and SDIO buses, it just depends which pins are used.
> > 
> 
> Hi and thanks for the quick reply! Honestly I've no idea, this is just the
> first method we found that allows for a proper reset of the chip. What I
> know is that some Surface devices need that ACPI DSM call (the one that was
> done in the commit I dropped in this version of the patchset) to reset the
> chip instead of this method.
> 
> Afaik other devices with this chip don't need this resetting method, at
> least Marvell employees couldn't reproduce the issues on their testing
> devices.
> 
> So would you suggest we just try to match for the pci chip 88W8897 instead?

Hello! Such suggestion makes sense when we know that it is 88W8897
issue. But if you got information that issue cannot be reproduced on
other 88W8897 cards then matching 88W8897 is not correct.

From all this information looks like that it is problem in (Microsoft?)
PCIe bridge to which is card connected. Otherwise I do not reason how it
can be 88W8897 affected. Either it is reproducible on 88W8897 cards also
in other devices or issue is not on 88W8897 card.

> Then we'd probably have to check if there are any laptops where multiple
> devices are connected to the pci bridge as Amey suggested in a review
> before.

Well, I do not know... But if this is issue with PCIe bridge then
similar issue could be observed also for other PCIe devices with this
PCIe bridge. But question is if there are other laptops with this PCIe
bridge. And also it can be a problem in ACPI firmware on those Surface
devices, which implements some PCIe bridge functionality. So it is
possible that issue is with PCIe bridge, not in HW, but in SW/firmware
part which can be Microsoft specific... So too many questions to which
we do not know answers.

Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on
affected machines? If you have already sent it in some previous email,
just send a link. At least I'm not able to find it right now and output
may contain something useful...

> > >   	mwifiex_dbg(adapter, INFO, "%s, successful\n", __func__);
> > >   	card->pci_reset_ongoing = true;
> > > diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
> > > index 4064f99b36ba..b5f214fc1212 100644
> > > --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
> > > +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
> > > @@ -15,6 +15,72 @@
> > >   /* quirk table based on DMI matching */
> > >   static const struct dmi_system_id mwifiex_quirk_table[] = {
> > > +	{
> > > +		.ident = "Surface Pro 4",
> > > +		.matches = {
> > > +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> > > +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 4"),
> > > +		},
> > > +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> > > +	},
> > > +	{
> > > +		.ident = "Surface Pro 5",
> > > +		.matches = {
> > > +			/* match for SKU here due to generic product name "Surface Pro" */
> > > +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> > > +			DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1796"),
> > > +		},
> > > +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> > > +	},
> > > +	{
> > > +		.ident = "Surface Pro 5 (LTE)",
> > > +		.matches = {
> > > +			/* match for SKU here due to generic product name "Surface Pro" */
> > > +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> > > +			DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1807"),
> > > +		},
> > > +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> > > +	},
> > > +	{
> > > +		.ident = "Surface Pro 6",
> > > +		.matches = {
> > > +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> > > +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 6"),
> > > +		},
> > > +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> > > +	},
> > > +	{
> > > +		.ident = "Surface Book 1",
> > > +		.matches = {
> > > +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> > > +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book"),
> > > +		},
> > > +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> > > +	},
> > > +	{
> > > +		.ident = "Surface Book 2",
> > > +		.matches = {
> > > +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> > > +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book 2"),
> > > +		},
> > > +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> > > +	},
> > > +	{
> > > +		.ident = "Surface Laptop 1",
> > > +		.matches = {
> > > +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> > > +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop"),
> > > +		},
> > > +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> > > +	},
> > > +	{
> > > +		.ident = "Surface Laptop 2",
> > > +		.matches = {
> > > +			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
> > > +			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop 2"),
> > > +		},
> > > +		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
> > > +	},
> > >   	{}
> > >   };
> > > @@ -29,4 +95,61 @@ void mwifiex_initialize_quirks(struct pcie_service_card *card)
> > >   	if (!card->quirks)
> > >   		dev_info(&pdev->dev, "no quirks enabled\n");
> > > +	if (card->quirks & QUIRK_FW_RST_D3COLD)
> > > +		dev_info(&pdev->dev, "quirk reset_d3cold enabled\n");
> > > +}
> > > +
> > > +static void mwifiex_pcie_set_power_d3cold(struct pci_dev *pdev)
> > > +{
> > > +	dev_info(&pdev->dev, "putting into D3cold...\n");
> > > +
> > > +	pci_save_state(pdev);
> > > +	if (pci_is_enabled(pdev))
> > > +		pci_disable_device(pdev);
> > > +	pci_set_power_state(pdev, PCI_D3cold);
> > > +}
> > > +
> > > +static int mwifiex_pcie_set_power_d0(struct pci_dev *pdev)
> > > +{
> > > +	int ret;
> > > +
> > > +	dev_info(&pdev->dev, "putting into D0...\n");
> > > +
> > > +	pci_set_power_state(pdev, PCI_D0);
> > > +	ret = pci_enable_device(pdev);
> > > +	if (ret) {
> > > +		dev_err(&pdev->dev, "pci_enable_device failed\n");
> > > +		return ret;
> > > +	}
> > > +	pci_restore_state(pdev);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev)
> > > +{
> > > +	struct pci_dev *parent_pdev = pci_upstream_bridge(pdev);
> > > +	int ret;
> > > +
> > > +	/* Power-cycle (put into D3cold then D0) */
> > > +	dev_info(&pdev->dev, "Using reset_d3cold quirk to perform FW reset\n");
> > > +
> > > +	/* We need to perform power-cycle also for bridge of wifi because
> > > +	 * on some devices (e.g. Surface Book 1), the OS for some reasons
> > > +	 * can't know the real power state of the bridge.
> > > +	 * When tried to power-cycle only wifi, the reset failed with the
> > > +	 * following dmesg log:
> > > +	 * "Cannot transition to power state D0 for parent in D3hot".
> > > +	 */
> > > +	mwifiex_pcie_set_power_d3cold(pdev);
> > > +	mwifiex_pcie_set_power_d3cold(parent_pdev);
> > > +
> > > +	ret = mwifiex_pcie_set_power_d0(parent_pdev);
> > > +	if (ret)
> > > +		return ret;
> > > +	ret = mwifiex_pcie_set_power_d0(pdev);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	return 0;
> > >   }
> > > diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
> > > index 7a1fe3b3a61a..549093067813 100644
> > > --- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
> > > +++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
> > > @@ -5,4 +5,7 @@
> > >   #include "pcie.h"
> > > +#define QUIRK_FW_RST_D3COLD	BIT(0)
> > > +
> > >   void mwifiex_initialize_quirks(struct pcie_service_card *card);
> > > +int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev);
> > > -- 
> > > 2.31.1
> > > 
>
Pali Rohár July 9, 2021, 4:31 p.m. UTC | #5
On Friday 09 July 2021 19:01:44 Andy Shevchenko wrote:
> On Fri, Jul 9, 2021 at 6:18 PM Pali Rohár <pali@kernel.org> wrote:
> > On Friday 09 July 2021 16:58:31 Jonas Dreßler wrote:
> 
> 
> > Hello! Now I'm thinking loudly about this patch. Why this kind of reset
> > is needed only for Surface devices? AFAIK these 88W8897 chips are same
> > in all cards. Chip itself implements PCIe interface (and also SDIO) so
> > for me looks very strange if this 88W8897 PCIe device needs DMI specific
> > quirks. I cannot believe that Microsoft got some special version of
> > these chips from Marvell which are different than version uses on cards
> > in mPCIe form factor.
> >
> > And now when I'm reading comment below about PCIe bridge to which is
> > this 88W8897 PCIe chip connected, is not this rather an issue in that
> > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
> > controls this bridge?
> >
> > Or are having other people same issues on mPCIe form factor wifi cards
> > with 88W8897 chips and then this quirk should not DMI dependent?
> >
> > Note that I'm seeing issues with reset and other things also on chip
> > 88W8997 when is connected to system via SDIO. These chips have both PCIe
> > and SDIO buses, it just depends which pins are used.
> 
> I'm replying loudly :-)
> 
> You know that depending on the interface the firmware even for the
> same chip may be way different. And if you have had any experience
> working in product companies you should know well that bug in product
> X is not gonna be fixed if it was not reported, but gets fixed on
> product Y due to that. Besides that, how do you know that MS has not
> been given the special edition of the FW?

Yes! But I know something about these chips/cards (I have also one
development kit) and it is quite different. It is possible that
Microsoft may have its special version, because I know that e.g. Google
got "fixed version" for some 88W8xxx chips (it is/was available on
internet). But firmware is loading by mwifiex driver and we
(linux-firmware) have just one version of firmware for these cards.
These 88W8xxx cards lost state and running firmware after reset/power so
after linux is booted, it loads "linux-firmware" version into 88W8897
card and then card not run "possibly MS special edition FW".

What can be possible is that we are dealing with ACPI firmware (which is
same for both Windows and Linux OS) and then it is related to PCIe
bridge where are some PCIe parts implemented...

> As icing on the cake, the Marvell has been bought and I believe they
> abandoned their products quite a while ago. You may read kernel
> bugzilla for the details (last Marvell developer who answered to the
> reports seems has no clue about the driver).

Marvell 88W[89]xxx wifi cards were sold to NXP together with developers.
Old @marvell addresses do not work so it is required to find new @nxp
addresses for developers.

There are recent firmware upgrades from NXP for linux-firmware, see:
https://lore.kernel.org/linux-firmware/DB7PR04MB453855B0D6C41923BCB0922EFC1C9@DB7PR04MB4538.eurprd04.prod.outlook.com/
(just only SDIO firmware for 88W8897, not PCIe)

But I know that response from NXP about these cards is slow... And more
people are complaining about firmware/driver issues for these cards.

> All that said, I believe that we have to follow whatever fixes we
> would have and be thankful to contributors and testers.

I agree. If this is really fixing these issues and we do not have better
fix yet, go ahead with it (if PCIe/WiFi maintainers are fine with it).

Once we have better fix, we can replace it.

Currently I'm just trying to understand this issue more deeply (e.g. how
it can relate with similar issue which I see on SDIO and if I cannot fix
also SDIO in better way) but seems that nobody knows more than "this
hack/quirk somehow works for PCIe".

> For the record, I've been suffering from the Linux driver of this
> hardware for a while. And I'm fully in support of any quirks that will
> help UX.
> 
> -- 
> With Best Regards,
> Andy Shevchenko
Maximilian Luz July 9, 2021, 5:03 p.m. UTC | #6
On 7/9/21 6:12 PM, Pali Rohár wrote:

[...]

>>> Hello! Now I'm thinking loudly about this patch. Why this kind of reset
>>> is needed only for Surface devices? AFAIK these 88W8897 chips are same
>>> in all cards. Chip itself implements PCIe interface (and also SDIO) so
>>> for me looks very strange if this 88W8897 PCIe device needs DMI specific
>>> quirks. I cannot believe that Microsoft got some special version of
>>> these chips from Marvell which are different than version uses on cards
>>> in mPCIe form factor.
>>>
>>> And now when I'm reading comment below about PCIe bridge to which is
>>> this 88W8897 PCIe chip connected, is not this rather an issue in that
>>> PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
>>> controls this bridge?
>>>
>>> Or are having other people same issues on mPCIe form factor wifi cards
>>> with 88W8897 chips and then this quirk should not DMI dependent?
>>>
>>> Note that I'm seeing issues with reset and other things also on chip
>>> 88W8997 when is connected to system via SDIO. These chips have both PCIe
>>> and SDIO buses, it just depends which pins are used.
>>>
>>
>> Hi and thanks for the quick reply! Honestly I've no idea, this is just the
>> first method we found that allows for a proper reset of the chip. What I
>> know is that some Surface devices need that ACPI DSM call (the one that was
>> done in the commit I dropped in this version of the patchset) to reset the
>> chip instead of this method.
>>
>> Afaik other devices with this chip don't need this resetting method, at
>> least Marvell employees couldn't reproduce the issues on their testing
>> devices.
>>
>> So would you suggest we just try to match for the pci chip 88W8897 instead?
> 
> Hello! Such suggestion makes sense when we know that it is 88W8897
> issue. But if you got information that issue cannot be reproduced on
> other 88W8897 cards then matching 88W8897 is not correct.
> 
>  From all this information looks like that it is problem in (Microsoft?)
> PCIe bridge to which is card connected. Otherwise I do not reason how it
> can be 88W8897 affected. Either it is reproducible on 88W8897 cards also
> in other devices or issue is not on 88W8897 card.

I doubt that it's an issue with the PCIe bridge (itself at least). The
same type of bridge is used for both dGPU and NVME SSD on my device (see
lspci output below) and those work fine. Also if I'm seeing that right
it's from the Intel CPU, so my guess is that a lot more people would
have issues with that then.

I don't know about the hardware side, so it might be possible that it's
an issue with integrating both bridge and wifi chip, in which case it's
still probably best handled via DMI quirks unless we know more.

Also as Tsuchiya mentioned in his original submission, on Windows the
device is reset via this D3cold method. I've only skimmed that
errata.inf file mentioned, but I think this is what he's referring to:

   Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be
   evaluated or not on a given platform. Currently
   ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be
   evaluated on Surface platforms which contain the Marvell WiFi
   controller which depends on device going through D3Cold as part of
   surprise-removal.

and

   Starting with Windows releases *after* Blue, ACPI will not put
   surprise-removed devices into D3Cold automatically. Some known
   scenarios (viz. WiFi reset/recovery) rely on the device cycling
   through D3Cold on surprise-removal. This hack allows surprise-removed
   devices to be put into D3Cold (if supported by the stack).

So, as far as I can tell, the chip doesn't like to be surprise-removed
(which seems to happen during reset) and then needs to be power-cycled,
which I think is likely due to some issue with firmware state.

So the quirk on Windows seems very Surface specific.

There also seem a bunch of revisions of these chips around, for example
my SB2 is affected by a bug that we've tied to the specific hardware
revision which causes some issues with host-sleep (IIRC chip switches
rapidly between wake and sleep states without any external influence,
which is not how it should behave and how it does behave on a later
hardware revision).

>> Then we'd probably have to check if there are any laptops where multiple
>> devices are connected to the pci bridge as Amey suggested in a review
>> before.
> 
> Well, I do not know... But if this is issue with PCIe bridge then
> similar issue could be observed also for other PCIe devices with this
> PCIe bridge. But question is if there are other laptops with this PCIe
> bridge. And also it can be a problem in ACPI firmware on those Surface
> devices, which implements some PCIe bridge functionality. So it is
> possible that issue is with PCIe bridge, not in HW, but in SW/firmware
> part which can be Microsoft specific... So too many questions to which
> we do not know answers.
> 
> Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on
> affected machines? If you have already sent it in some previous email,
> just send a link. At least I'm not able to find it right now and output
> may contain something useful...

 From my Surface Book 2 (with the same issue):

  - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/
  - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/

Regards,
Max
Pali Rohár July 9, 2021, 5:30 p.m. UTC | #7
On Friday 09 July 2021 19:03:37 Maximilian Luz wrote:
> On 7/9/21 6:12 PM, Pali Rohár wrote:
> 
> [...]
> 
> > > > Hello! Now I'm thinking loudly about this patch. Why this kind of reset
> > > > is needed only for Surface devices? AFAIK these 88W8897 chips are same
> > > > in all cards. Chip itself implements PCIe interface (and also SDIO) so
> > > > for me looks very strange if this 88W8897 PCIe device needs DMI specific
> > > > quirks. I cannot believe that Microsoft got some special version of
> > > > these chips from Marvell which are different than version uses on cards
> > > > in mPCIe form factor.
> > > > 
> > > > And now when I'm reading comment below about PCIe bridge to which is
> > > > this 88W8897 PCIe chip connected, is not this rather an issue in that
> > > > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
> > > > controls this bridge?
> > > > 
> > > > Or are having other people same issues on mPCIe form factor wifi cards
> > > > with 88W8897 chips and then this quirk should not DMI dependent?
> > > > 
> > > > Note that I'm seeing issues with reset and other things also on chip
> > > > 88W8997 when is connected to system via SDIO. These chips have both PCIe
> > > > and SDIO buses, it just depends which pins are used.
> > > > 
> > > 
> > > Hi and thanks for the quick reply! Honestly I've no idea, this is just the
> > > first method we found that allows for a proper reset of the chip. What I
> > > know is that some Surface devices need that ACPI DSM call (the one that was
> > > done in the commit I dropped in this version of the patchset) to reset the
> > > chip instead of this method.
> > > 
> > > Afaik other devices with this chip don't need this resetting method, at
> > > least Marvell employees couldn't reproduce the issues on their testing
> > > devices.
> > > 
> > > So would you suggest we just try to match for the pci chip 88W8897 instead?
> > 
> > Hello! Such suggestion makes sense when we know that it is 88W8897
> > issue. But if you got information that issue cannot be reproduced on
> > other 88W8897 cards then matching 88W8897 is not correct.
> > 
> >  From all this information looks like that it is problem in (Microsoft?)
> > PCIe bridge to which is card connected. Otherwise I do not reason how it
> > can be 88W8897 affected. Either it is reproducible on 88W8897 cards also
> > in other devices or issue is not on 88W8897 card.
> 
> I doubt that it's an issue with the PCIe bridge (itself at least). The
> same type of bridge is used for both dGPU and NVME SSD on my device (see
> lspci output below) and those work fine. Also if I'm seeing that right
> it's from the Intel CPU, so my guess is that a lot more people would
> have issues with that then.

From information below it seems to be related to surprise removal.
Therefore is surprise removal working without issue for dGPU or NVME
SSD? Not all PCIe bridges support surprise removal...

> 
> I don't know about the hardware side, so it might be possible that it's
> an issue with integrating both bridge and wifi chip, in which case it's
> still probably best handled via DMI quirks unless we know more.
> 
> Also as Tsuchiya mentioned in his original submission, on Windows the
> device is reset via this D3cold method. I've only skimmed that
> errata.inf file mentioned, but I think this is what he's referring to:
> 
>   Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be
>   evaluated or not on a given platform. Currently
>   ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be
>   evaluated on Surface platforms which contain the Marvell WiFi
>   controller which depends on device going through D3Cold as part of
>   surprise-removal.
> 
> and
> 
>   Starting with Windows releases *after* Blue, ACPI will not put
>   surprise-removed devices into D3Cold automatically. Some known
>   scenarios (viz. WiFi reset/recovery) rely on the device cycling
>   through D3Cold on surprise-removal. This hack allows surprise-removed
>   devices to be put into D3Cold (if supported by the stack).
> 
> So, as far as I can tell, the chip doesn't like to be surprise-removed
> (which seems to happen during reset) and then needs to be power-cycled,
> which I think is likely due to some issue with firmware state.

Thanks for information. This really does not look like PCIe bridge
specific if bridge itself can handle surprise-removed devices. lspci can
tell us if bridge supports it or not (see below).

> So the quirk on Windows seems very Surface specific.
> 
> There also seem a bunch of revisions of these chips around, for example
> my SB2 is affected by a bug that we've tied to the specific hardware
> revision which causes some issues with host-sleep (IIRC chip switches
> rapidly between wake and sleep states without any external influence,
> which is not how it should behave and how it does behave on a later
> hardware revision).

Interesting... This looks like the issue can be in 88W8897 chip and
needs some special conditions to trigger? And Surface is triggering it
always?

> > > Then we'd probably have to check if there are any laptops where multiple
> > > devices are connected to the pci bridge as Amey suggested in a review
> > > before.
> > 
> > Well, I do not know... But if this is issue with PCIe bridge then
> > similar issue could be observed also for other PCIe devices with this
> > PCIe bridge. But question is if there are other laptops with this PCIe
> > bridge. And also it can be a problem in ACPI firmware on those Surface
> > devices, which implements some PCIe bridge functionality. So it is
> > possible that issue is with PCIe bridge, not in HW, but in SW/firmware
> > part which can be Microsoft specific... So too many questions to which
> > we do not know answers.
> > 
> > Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on
> > affected machines? If you have already sent it in some previous email,
> > just send a link. At least I'm not able to find it right now and output
> > may contain something useful...
> 
> From my Surface Book 2 (with the same issue):
> 
>  - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/
>  - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/

Could you re-run lspci under root account? There are missing important
parts like "Capabilities: <access denied>" where is information if
bridge supports surprise removal or not.

> Regards,
> Max
Maximilian Luz July 9, 2021, 6:16 p.m. UTC | #8
On 7/9/21 7:30 PM, Pali Rohár wrote:
> On Friday 09 July 2021 19:03:37 Maximilian Luz wrote:
>> On 7/9/21 6:12 PM, Pali Rohár wrote:
>>
>> [...]
>>
>>>>> Hello! Now I'm thinking loudly about this patch. Why this kind of reset
>>>>> is needed only for Surface devices? AFAIK these 88W8897 chips are same
>>>>> in all cards. Chip itself implements PCIe interface (and also SDIO) so
>>>>> for me looks very strange if this 88W8897 PCIe device needs DMI specific
>>>>> quirks. I cannot believe that Microsoft got some special version of
>>>>> these chips from Marvell which are different than version uses on cards
>>>>> in mPCIe form factor.
>>>>>
>>>>> And now when I'm reading comment below about PCIe bridge to which is
>>>>> this 88W8897 PCIe chip connected, is not this rather an issue in that
>>>>> PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
>>>>> controls this bridge?
>>>>>
>>>>> Or are having other people same issues on mPCIe form factor wifi cards
>>>>> with 88W8897 chips and then this quirk should not DMI dependent?
>>>>>
>>>>> Note that I'm seeing issues with reset and other things also on chip
>>>>> 88W8997 when is connected to system via SDIO. These chips have both PCIe
>>>>> and SDIO buses, it just depends which pins are used.
>>>>>
>>>>
>>>> Hi and thanks for the quick reply! Honestly I've no idea, this is just the
>>>> first method we found that allows for a proper reset of the chip. What I
>>>> know is that some Surface devices need that ACPI DSM call (the one that was
>>>> done in the commit I dropped in this version of the patchset) to reset the
>>>> chip instead of this method.
>>>>
>>>> Afaik other devices with this chip don't need this resetting method, at
>>>> least Marvell employees couldn't reproduce the issues on their testing
>>>> devices.
>>>>
>>>> So would you suggest we just try to match for the pci chip 88W8897 instead?
>>>
>>> Hello! Such suggestion makes sense when we know that it is 88W8897
>>> issue. But if you got information that issue cannot be reproduced on
>>> other 88W8897 cards then matching 88W8897 is not correct.
>>>
>>>   From all this information looks like that it is problem in (Microsoft?)
>>> PCIe bridge to which is card connected. Otherwise I do not reason how it
>>> can be 88W8897 affected. Either it is reproducible on 88W8897 cards also
>>> in other devices or issue is not on 88W8897 card.
>>
>> I doubt that it's an issue with the PCIe bridge (itself at least). The
>> same type of bridge is used for both dGPU and NVME SSD on my device (see
>> lspci output below) and those work fine. Also if I'm seeing that right
>> it's from the Intel CPU, so my guess is that a lot more people would
>> have issues with that then.
> 
>  From information below it seems to be related to surprise removal.
> Therefore is surprise removal working without issue for dGPU or NVME
> SSD? Not all PCIe bridges support surprise removal...

The dGPU on the Surface Book 2 is detachable (the whole base where that
is placed can be removed). As far as I can tell surprise removal works
perfectly fine for that one. The only thing that it needs is a driver for
out-of-band hot-plug signalling if the device is in D3cold while removed
as hotplug/removal notifications via PCI don't work in D3cold (this
works via ACPI, there is as far as I can tell no such mechanism for
WiFi, probably since it's not intended to be hot-unplugged).

>> I don't know about the hardware side, so it might be possible that it's
>> an issue with integrating both bridge and wifi chip, in which case it's
>> still probably best handled via DMI quirks unless we know more.
>>
>> Also as Tsuchiya mentioned in his original submission, on Windows the
>> device is reset via this D3cold method. I've only skimmed that
>> errata.inf file mentioned, but I think this is what he's referring to:
>>
>>    Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be
>>    evaluated or not on a given platform. Currently
>>    ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be
>>    evaluated on Surface platforms which contain the Marvell WiFi
>>    controller which depends on device going through D3Cold as part of
>>    surprise-removal.
>>
>> and
>>
>>    Starting with Windows releases *after* Blue, ACPI will not put
>>    surprise-removed devices into D3Cold automatically. Some known
>>    scenarios (viz. WiFi reset/recovery) rely on the device cycling
>>    through D3Cold on surprise-removal. This hack allows surprise-removed
>>    devices to be put into D3Cold (if supported by the stack).
>>
>> So, as far as I can tell, the chip doesn't like to be surprise-removed
>> (which seems to happen during reset) and then needs to be power-cycled,
>> which I think is likely due to some issue with firmware state.
> 
> Thanks for information. This really does not look like PCIe bridge
> specific if bridge itself can handle surprise-removed devices. lspci can
> tell us if bridge supports it or not (see below).
> 
>> So the quirk on Windows seems very Surface specific.
>>
>> There also seem a bunch of revisions of these chips around, for example
>> my SB2 is affected by a bug that we've tied to the specific hardware
>> revision which causes some issues with host-sleep (IIRC chip switches
>> rapidly between wake and sleep states without any external influence,
>> which is not how it should behave and how it does behave on a later
>> hardware revision).
> 
> Interesting... This looks like the issue can be in 88W8897 chip and
> needs some special conditions to trigger? And Surface is triggering it
> always?

Not always. It's been a while since I've been actively looking at this
and I'm not sure we ever had a good way to reproduce this. Also, I've
never really dealt with it as in-depth as Tsuchiya and Jonas have.

My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at
reproducing this didn't work, so I think at very least a network
connection needs to be active. Unfortunately I can't test that with a
network connection (and without compiling a custom kernel for which I
don't have the time right now) because there's currently another bug
deadlocking on device removal if there's an active connection during
removal (which also seems to trigger on reset). That one ill be fixed
by

   https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/

Jonas might know more.

>>>> Then we'd probably have to check if there are any laptops where multiple
>>>> devices are connected to the pci bridge as Amey suggested in a review
>>>> before.
>>>
>>> Well, I do not know... But if this is issue with PCIe bridge then
>>> similar issue could be observed also for other PCIe devices with this
>>> PCIe bridge. But question is if there are other laptops with this PCIe
>>> bridge. And also it can be a problem in ACPI firmware on those Surface
>>> devices, which implements some PCIe bridge functionality. So it is
>>> possible that issue is with PCIe bridge, not in HW, but in SW/firmware
>>> part which can be Microsoft specific... So too many questions to which
>>> we do not know answers.
>>>
>>> Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on
>>> affected machines? If you have already sent it in some previous email,
>>> just send a link. At least I'm not able to find it right now and output
>>> may contain something useful...
>>
>>  From my Surface Book 2 (with the same issue):
>>
>>   - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/
>>   - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/
> 
> Could you re-run lspci under root account? There are missing important
> parts like "Capabilities: <access denied>" where is information if
> bridge supports surprise removal or not.

Ah sorry, sure thing. Here's the updated lspci -nn -vv log:

   https://paste.ubuntu.com/p/fzsmCvm86Y/

The log for lspci -tvnn is the same.

>> Regards,
>> Max
Pali Rohár July 9, 2021, 6:44 p.m. UTC | #9
On Friday 09 July 2021 20:16:49 Maximilian Luz wrote:
> On 7/9/21 7:30 PM, Pali Rohár wrote:
> > On Friday 09 July 2021 19:03:37 Maximilian Luz wrote:
> > > On 7/9/21 6:12 PM, Pali Rohár wrote:
> > > 
> > > [...]
> > > 
> > > > > > Hello! Now I'm thinking loudly about this patch. Why this kind of reset
> > > > > > is needed only for Surface devices? AFAIK these 88W8897 chips are same
> > > > > > in all cards. Chip itself implements PCIe interface (and also SDIO) so
> > > > > > for me looks very strange if this 88W8897 PCIe device needs DMI specific
> > > > > > quirks. I cannot believe that Microsoft got some special version of
> > > > > > these chips from Marvell which are different than version uses on cards
> > > > > > in mPCIe form factor.
> > > > > > 
> > > > > > And now when I'm reading comment below about PCIe bridge to which is
> > > > > > this 88W8897 PCIe chip connected, is not this rather an issue in that
> > > > > > PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
> > > > > > controls this bridge?
> > > > > > 
> > > > > > Or are having other people same issues on mPCIe form factor wifi cards
> > > > > > with 88W8897 chips and then this quirk should not DMI dependent?
> > > > > > 
> > > > > > Note that I'm seeing issues with reset and other things also on chip
> > > > > > 88W8997 when is connected to system via SDIO. These chips have both PCIe
> > > > > > and SDIO buses, it just depends which pins are used.
> > > > > > 
> > > > > 
> > > > > Hi and thanks for the quick reply! Honestly I've no idea, this is just the
> > > > > first method we found that allows for a proper reset of the chip. What I
> > > > > know is that some Surface devices need that ACPI DSM call (the one that was
> > > > > done in the commit I dropped in this version of the patchset) to reset the
> > > > > chip instead of this method.
> > > > > 
> > > > > Afaik other devices with this chip don't need this resetting method, at
> > > > > least Marvell employees couldn't reproduce the issues on their testing
> > > > > devices.
> > > > > 
> > > > > So would you suggest we just try to match for the pci chip 88W8897 instead?
> > > > 
> > > > Hello! Such suggestion makes sense when we know that it is 88W8897
> > > > issue. But if you got information that issue cannot be reproduced on
> > > > other 88W8897 cards then matching 88W8897 is not correct.
> > > > 
> > > >   From all this information looks like that it is problem in (Microsoft?)
> > > > PCIe bridge to which is card connected. Otherwise I do not reason how it
> > > > can be 88W8897 affected. Either it is reproducible on 88W8897 cards also
> > > > in other devices or issue is not on 88W8897 card.
> > > 
> > > I doubt that it's an issue with the PCIe bridge (itself at least). The
> > > same type of bridge is used for both dGPU and NVME SSD on my device (see
> > > lspci output below) and those work fine. Also if I'm seeing that right
> > > it's from the Intel CPU, so my guess is that a lot more people would
> > > have issues with that then.
> > 
> >  From information below it seems to be related to surprise removal.
> > Therefore is surprise removal working without issue for dGPU or NVME
> > SSD? Not all PCIe bridges support surprise removal...
> 
> The dGPU on the Surface Book 2 is detachable (the whole base where that
> is placed can be removed). As far as I can tell surprise removal works
> perfectly fine for that one. The only thing that it needs is a driver for
> out-of-band hot-plug signalling if the device is in D3cold while removed
> as hotplug/removal notifications via PCI don't work in D3cold (this
> works via ACPI, there is as far as I can tell no such mechanism for
> WiFi, probably since it's not intended to be hot-unplugged).

Ok. Thank you for confirmation.

> > > I don't know about the hardware side, so it might be possible that it's
> > > an issue with integrating both bridge and wifi chip, in which case it's
> > > still probably best handled via DMI quirks unless we know more.
> > > 
> > > Also as Tsuchiya mentioned in his original submission, on Windows the
> > > device is reset via this D3cold method. I've only skimmed that
> > > errata.inf file mentioned, but I think this is what he's referring to:
> > > 
> > >    Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be
> > >    evaluated or not on a given platform. Currently
> > >    ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be
> > >    evaluated on Surface platforms which contain the Marvell WiFi
> > >    controller which depends on device going through D3Cold as part of
> > >    surprise-removal.
> > > 
> > > and
> > > 
> > >    Starting with Windows releases *after* Blue, ACPI will not put
> > >    surprise-removed devices into D3Cold automatically. Some known
> > >    scenarios (viz. WiFi reset/recovery) rely on the device cycling
> > >    through D3Cold on surprise-removal. This hack allows surprise-removed
> > >    devices to be put into D3Cold (if supported by the stack).
> > > 
> > > So, as far as I can tell, the chip doesn't like to be surprise-removed
> > > (which seems to happen during reset) and then needs to be power-cycled,
> > > which I think is likely due to some issue with firmware state.
> > 
> > Thanks for information. This really does not look like PCIe bridge
> > specific if bridge itself can handle surprise-removed devices. lspci can
> > tell us if bridge supports it or not (see below).
> > 
> > > So the quirk on Windows seems very Surface specific.
> > > 
> > > There also seem a bunch of revisions of these chips around, for example
> > > my SB2 is affected by a bug that we've tied to the specific hardware
> > > revision which causes some issues with host-sleep (IIRC chip switches
> > > rapidly between wake and sleep states without any external influence,
> > > which is not how it should behave and how it does behave on a later
> > > hardware revision).
> > 
> > Interesting... This looks like the issue can be in 88W8897 chip and
> > needs some special conditions to trigger? And Surface is triggering it
> > always?
> 
> Not always. It's been a while since I've been actively looking at this
> and I'm not sure we ever had a good way to reproduce this. Also, I've
> never really dealt with it as in-depth as Tsuchiya and Jonas have.
> 
> My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at
> reproducing this didn't work, so I think at very least a network
> connection needs to be active.

This is doing PCIe function level reset. Maybe you can get more luck
with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset
from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux

> Unfortunately I can't test that with a
> network connection (and without compiling a custom kernel for which I
> don't have the time right now) because there's currently another bug
> deadlocking on device removal if there's an active connection during
> removal (which also seems to trigger on reset). That one ill be fixed
> by
> 
>   https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/
> 
> Jonas might know more.
> 
> > > > > Then we'd probably have to check if there are any laptops where multiple
> > > > > devices are connected to the pci bridge as Amey suggested in a review
> > > > > before.
> > > > 
> > > > Well, I do not know... But if this is issue with PCIe bridge then
> > > > similar issue could be observed also for other PCIe devices with this
> > > > PCIe bridge. But question is if there are other laptops with this PCIe
> > > > bridge. And also it can be a problem in ACPI firmware on those Surface
> > > > devices, which implements some PCIe bridge functionality. So it is
> > > > possible that issue is with PCIe bridge, not in HW, but in SW/firmware
> > > > part which can be Microsoft specific... So too many questions to which
> > > > we do not know answers.
> > > > 
> > > > Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on
> > > > affected machines? If you have already sent it in some previous email,
> > > > just send a link. At least I'm not able to find it right now and output
> > > > may contain something useful...
> > > 
> > >  From my Surface Book 2 (with the same issue):
> > > 
> > >   - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/
> > >   - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/
> > 
> > Could you re-run lspci under root account? There are missing important
> > parts like "Capabilities: <access denied>" where is information if
> > bridge supports surprise removal or not.
> 
> Ah sorry, sure thing. Here's the updated lspci -nn -vv log:
> 
>   https://paste.ubuntu.com/p/fzsmCvm86Y/
> 
> The log for lspci -tvnn is the same.

Ok. So bridge for wifi card (00:1c.0) indicates:

    SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
            Slot #0, PowerLimit 10.000W; Interlock- NoCompl+

No support for surprise removal, nor for hotplug interrupt.

But bridge for nvidia card (00:1c.4) indicates:

    SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+
            Slot #4, PowerLimit 25.000W; Interlock- NoCompl+

And interesting, it supports hotplug interrupt and also surprise
removal. Which matches what you wrote above about dGPU.

So another idea: maybe problem is really in 88W8897 and recovering is
working only via bridge which supports surprise removal? Just guessing.
Or kernel PCIe hotplug driver is doing something which is needed for
recovering 88W8897 and because this bridge does not support surprise
removal, it behaves differently?

> 
> > > Regards,
> > > Max
Maximilian Luz July 9, 2021, 7:27 p.m. UTC | #10
On 7/9/21 8:44 PM, Pali Rohár wrote:

[...]

>> My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at
>> reproducing this didn't work, so I think at very least a network
>> connection needs to be active.
> 
> This is doing PCIe function level reset. Maybe you can get more luck
> with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset
> from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux

Thanks for that link! That does indeed do something which breaks the
adapter. Running the script produces

   [  178.388414] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
   [  178.389128] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
   [  178.461365] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
   [  178.461373] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
   [  178.984106] pci 0000:01:00.0: [11ab:2b38] type 00 class 0x020000
   [  178.984161] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit pref]
   [  178.984193] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x000fffff 64bit pref]
   [  178.984430] pci 0000:01:00.0: supports D1 D2
   [  178.984434] pci 0000:01:00.0: PME# supported from D0 D1 D3hot D3cold
   [  178.984871] pcieport 0000:00:1c.0: ASPM: current common clock configuration is inconsistent, reconfiguring
   [  179.297919] pci 0000:01:00.0: BAR 0: assigned [mem 0xd4400000-0xd44fffff 64bit pref]
   [  179.297961] pci 0000:01:00.0: BAR 2: assigned [mem 0xd4500000-0xd45fffff 64bit pref]
   [  179.298316] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002)
   [  179.298752] mwifiex_pcie: PCI memory map Virt0: 00000000c4593df1 PCI memory map Virt2: 0000000039d67daf
   [  179.300522] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
   [  179.300552] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
   [  179.300622] mwifiex_pcie 0000:01:00.0: Read register failed
   [  179.300912] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
   [  179.300928] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done

after which the card is unusable (there is no WiFi interface availabel
any more). Reloading the driver module doesn't help and produces

   [  376.906833] mwifiex_pcie: PCI memory map Virt0: 0000000025149d28 PCI memory map Virt2: 00000000c4593df1
   [  376.907278] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
   [  376.907281] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
   [  376.907293] mwifiex_pcie 0000:01:00.0: Read register failed
   [  376.907404] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
   [  376.907406] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done

again. Performing a function level reset produces

   [  402.489572] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_prepare: adapter structure is not valid
   [  403.514219] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_done: adapter structure is not valid

and doesn't help either.

Running the same command on a kernel with (among other) this patch
unfortunately also breaks the adapter in the same way. As far as I can
tell though, it doesn't run through the reset code added by this patch
(as indicated by the log message when performing FLR), which I assume
in a non-forced scenario, e.g. firmware issues (which IIRC is why this
patch exists), it would?

>> Unfortunately I can't test that with a
>> network connection (and without compiling a custom kernel for which I
>> don't have the time right now) because there's currently another bug
>> deadlocking on device removal if there's an active connection during
>> removal (which also seems to trigger on reset). That one ill be fixed
>> by
>>
>>    https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/
>>
>> Jonas might know more.

[...]
Pali Rohár July 9, 2021, 7:44 p.m. UTC | #11
On Friday 09 July 2021 21:27:51 Maximilian Luz wrote:
> On 7/9/21 8:44 PM, Pali Rohár wrote:
> 
> [...]
> 
> > > My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at
> > > reproducing this didn't work, so I think at very least a network
> > > connection needs to be active.
> > 
> > This is doing PCIe function level reset. Maybe you can get more luck
> > with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset
> > from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux
> 
> Thanks for that link! That does indeed do something which breaks the
> adapter. Running the script produces
> 
>   [  178.388414] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
>   [  178.389128] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
>   [  178.461365] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
>   [  178.461373] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
>   [  178.984106] pci 0000:01:00.0: [11ab:2b38] type 00 class 0x020000
>   [  178.984161] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit pref]
>   [  178.984193] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x000fffff 64bit pref]
>   [  178.984430] pci 0000:01:00.0: supports D1 D2
>   [  178.984434] pci 0000:01:00.0: PME# supported from D0 D1 D3hot D3cold
>   [  178.984871] pcieport 0000:00:1c.0: ASPM: current common clock configuration is inconsistent, reconfiguring
>   [  179.297919] pci 0000:01:00.0: BAR 0: assigned [mem 0xd4400000-0xd44fffff 64bit pref]
>   [  179.297961] pci 0000:01:00.0: BAR 2: assigned [mem 0xd4500000-0xd45fffff 64bit pref]
>   [  179.298316] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002)
>   [  179.298752] mwifiex_pcie: PCI memory map Virt0: 00000000c4593df1 PCI memory map Virt2: 0000000039d67daf
>   [  179.300522] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
>   [  179.300552] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
>   [  179.300622] mwifiex_pcie 0000:01:00.0: Read register failed
>   [  179.300912] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
>   [  179.300928] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
> 
> after which the card is unusable (there is no WiFi interface availabel
> any more). Reloading the driver module doesn't help and produces
> 
>   [  376.906833] mwifiex_pcie: PCI memory map Virt0: 0000000025149d28 PCI memory map Virt2: 00000000c4593df1
>   [  376.907278] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
>   [  376.907281] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
>   [  376.907293] mwifiex_pcie 0000:01:00.0: Read register failed
>   [  376.907404] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
>   [  376.907406] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
> 
> again. Performing a function level reset produces
> 
>   [  402.489572] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_prepare: adapter structure is not valid
>   [  403.514219] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_done: adapter structure is not valid
> 
> and doesn't help either.

More Qualcomm/Atheros wifi cards are broken in a way that they stop
responding after PCIe Hot Reset and completely disappear from the PCIe
bus. It is possible that similar issue have also these Marvell cards?

As now we know that bride does not support hotplug it cannot inform
system when card disconnect from the bus. The one possible way how to
detect if PCIe card is available at specific address is trying to read
its device and vendor id. Kernel caches device/vendor id, so from
userspace is needed to call lspci with -b argument (to ignore kernel's
reported cached value). Could you provide output of following command
after you do PCIe Hot Reset?

    lspci -s 01:00.0 -b -vv

(and compare with output which you have already provided if there are
any differences)

If PCIe Hot Reset is breaking the card then the only option how to
"reset" card into working state again is PCIe Warm Reset. Unfortunately
there is no common mechanism how to do it from system. PCIe Warm Reset
is done by asserting PERST# signal on card itself, in mPCIe form factor
it is pin 22. In most cases pin 22 is connected to some GPIO so via GPIO
subsystem it could be controlled.

> Running the same command on a kernel with (among other) this patch
> unfortunately also breaks the adapter in the same way. As far as I can
> tell though, it doesn't run through the reset code added by this patch
> (as indicated by the log message when performing FLR), which I assume
> in a non-forced scenario, e.g. firmware issues (which IIRC is why this
> patch exists), it would?

Err... I have caught this part. Is proposed patch able to recover also
from state which happens after PCIe Hot Reset?

> > > Unfortunately I can't test that with a
> > > network connection (and without compiling a custom kernel for which I
> > > don't have the time right now) because there's currently another bug
> > > deadlocking on device removal if there's an active connection during
> > > removal (which also seems to trigger on reset). That one ill be fixed
> > > by
> > > 
> > >    https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/
> > > 
> > > Jonas might know more.
> 
> [...]
Maximilian Luz July 9, 2021, 8:54 p.m. UTC | #12
On 7/9/21 9:44 PM, Pali Rohár wrote:
> On Friday 09 July 2021 21:27:51 Maximilian Luz wrote:
>> On 7/9/21 8:44 PM, Pali Rohár wrote:
>>
>> [...]
>>
>>>> My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at
>>>> reproducing this didn't work, so I think at very least a network
>>>> connection needs to be active.
>>>
>>> This is doing PCIe function level reset. Maybe you can get more luck
>>> with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset
>>> from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux
>>
>> Thanks for that link! That does indeed do something which breaks the
>> adapter. Running the script produces
>>
>>    [  178.388414] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
>>    [  178.389128] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
>>    [  178.461365] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
>>    [  178.461373] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
>>    [  178.984106] pci 0000:01:00.0: [11ab:2b38] type 00 class 0x020000
>>    [  178.984161] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit pref]
>>    [  178.984193] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x000fffff 64bit pref]
>>    [  178.984430] pci 0000:01:00.0: supports D1 D2
>>    [  178.984434] pci 0000:01:00.0: PME# supported from D0 D1 D3hot D3cold
>>    [  178.984871] pcieport 0000:00:1c.0: ASPM: current common clock configuration is inconsistent, reconfiguring
>>    [  179.297919] pci 0000:01:00.0: BAR 0: assigned [mem 0xd4400000-0xd44fffff 64bit pref]
>>    [  179.297961] pci 0000:01:00.0: BAR 2: assigned [mem 0xd4500000-0xd45fffff 64bit pref]
>>    [  179.298316] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002)
>>    [  179.298752] mwifiex_pcie: PCI memory map Virt0: 00000000c4593df1 PCI memory map Virt2: 0000000039d67daf
>>    [  179.300522] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
>>    [  179.300552] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
>>    [  179.300622] mwifiex_pcie 0000:01:00.0: Read register failed
>>    [  179.300912] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
>>    [  179.300928] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
>>
>> after which the card is unusable (there is no WiFi interface availabel
>> any more). Reloading the driver module doesn't help and produces
>>
>>    [  376.906833] mwifiex_pcie: PCI memory map Virt0: 0000000025149d28 PCI memory map Virt2: 00000000c4593df1
>>    [  376.907278] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
>>    [  376.907281] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
>>    [  376.907293] mwifiex_pcie 0000:01:00.0: Read register failed
>>    [  376.907404] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
>>    [  376.907406] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
>>
>> again. Performing a function level reset produces
>>
>>    [  402.489572] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_prepare: adapter structure is not valid
>>    [  403.514219] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_done: adapter structure is not valid
>>
>> and doesn't help either.
> 
> More Qualcomm/Atheros wifi cards are broken in a way that they stop
> responding after PCIe Hot Reset and completely disappear from the PCIe
> bus. It is possible that similar issue have also these Marvell cards?
> 
> As now we know that bride does not support hotplug it cannot inform
> system when card disconnect from the bus. The one possible way how to
> detect if PCIe card is available at specific address is trying to read
> its device and vendor id. Kernel caches device/vendor id, so from
> userspace is needed to call lspci with -b argument (to ignore kernel's
> reported cached value). Could you provide output of following command
> after you do PCIe Hot Reset?
> 
>      lspci -s 01:00.0 -b -vv
> 
> (and compare with output which you have already provided if there are
> any differences)

There do seem to be some differences, specifically regarding memory.
See https://paste.ubuntu.com/p/Rz2CDjwkCv/ for the full output.

> If PCIe Hot Reset is breaking the card then the only option how to
> "reset" card into working state again is PCIe Warm Reset. Unfortunately
> there is no common mechanism how to do it from system. PCIe Warm Reset
> is done by asserting PERST# signal on card itself, in mPCIe form factor
> it is pin 22. In most cases pin 22 is connected to some GPIO so via GPIO
> subsystem it could be controlled.
> 
>> Running the same command on a kernel with (among other) this patch
>> unfortunately also breaks the adapter in the same way. As far as I can
>> tell though, it doesn't run through the reset code added by this patch
>> (as indicated by the log message when performing FLR), which I assume
>> in a non-forced scenario, e.g. firmware issues (which IIRC is why this
>> patch exists), it would?
> 
> Err... I have caught this part. Is proposed patch able to recover also
> from state which happens after PCIe Hot Reset?

I'm not sure at this point if the power-cycle through D3cold would fix
this (I think it might, but have no evidence for that). This patch alone
isn't able to recover the device, as, when triggering the hot-reset via
that script, the code never seems to run mwifiex_pcie_reset_d3cold_quirk().

If I remember correctly, the main issue was that the firmware state
doesn't get reset completely. This can be somewhat observed when doing
'echo 1 > /sys/bus/pci/devices/.../reset' via the difference in log
messages:

For an unpatched kernel:

   [   64.159509] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
   [   64.159546] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
   [   64.159922] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
   [   65.240272] mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip FW dnld
   [   65.240285] mwifiex_pcie 0000:01:00.0: WLAN FW is active
   [   65.327359] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21)
   [   65.327370] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.19.p21)

For a patched kernel:

   [   41.966094] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
   [   41.966451] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
   [   41.967227] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
   [   42.063543] mwifiex_pcie 0000:01:00.0: Using reset_d3cold quirk to perform FW reset
   [   42.063558] mwifiex_pcie 0000:01:00.0: putting into D3cold...
   [   42.081010] usb 1-6: USB disconnect, device number 9
   [   42.339922] pcieport 0000:00:1c.0: putting into D3cold...
   [   42.425766] pcieport 0000:00:1c.0: putting into D0...
   [   42.695987] mwifiex_pcie 0000:01:00.0: putting into D0...
   [   42.956673] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002)
   [   45.012736] mwifiex_pcie 0000:01:00.0: info: FW download over, size 723540 bytes
   [   45.740882] usb 1-6: new high-speed USB device number 10 using xhci_hcd
   [   45.881294] usb 1-6: New USB device found, idVendor=1286, idProduct=204c, bcdDevice=32.01
   [   45.881308] usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
   [   45.881313] usb 1-6: Product: Bluetooth and Wireless LAN Composite Device
   [   45.881318] usb 1-6: Manufacturer: Marvell
   [   45.881322] usb 1-6: SerialNumber: 0000000000000000
   [   45.884882] Bluetooth: hci0: unexpected event for opcode 0x0000
   [   45.885128] Bluetooth: hci0: unexpected event for opcode 0x0000
   [   45.966218] mwifiex_pcie 0000:01:00.0: WLAN FW is active
   [   46.157474] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21)
   [   46.157485] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.19.p21)

Note the absence of "WLAN FW already running! Skip FW dnld" on the
second log. Due to the power cycle we're essentially forcing the device
to re-download and re-initialize its firmware. On an unpatched kernel,
it looks like the firmware itself is kept and state may not be cleared
properly. So in other words it looks like the firmware, while being
prompted to do a reset, doesn't do that properly (at least when it has
crashed before).

IIRC this then allowed us to recover from firmware issues that the
"normal" firmware reset that mwifiex supposedly performs on a FLR
(or reloading driver modules) didn't help with.

I'm not so sure any more if resetting actively caused issues or if it
just showed different symptoms of some firmware issue that prompted us
to do the reset in the first place (again, been quite a while since I
last dealt with this stuff, sorry). All I know is that this patched
reset gets the card going again.

As a side note: There are also more patches by Jonas (and Tsuchiya?)
building on top of the quirk implementation introduced here which
significantly reduce the need for doing resets in the first place
(nevertheless having a reset that actually does properly reset the
device is a good thing IMHO). Those patches can for example be found
here:

   https://github.com/linux-surface/kernel/compare/eaaf96ba58a5fe5999b89fe3afaded74caa96480...989c8725a6d4e62db6370dd0fefe45498274d3ce

>>>> Unfortunately I can't test that with a
>>>> network connection (and without compiling a custom kernel for which I
>>>> don't have the time right now) because there's currently another bug
>>>> deadlocking on device removal if there's an active connection during
>>>> removal (which also seems to trigger on reset). That one ill be fixed
>>>> by
>>>>
>>>>     https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/
>>>>
>>>> Jonas might know more.
>>
>> [...]
Pali Rohár July 9, 2021, 9:25 p.m. UTC | #13
On Friday 09 July 2021 22:54:09 Maximilian Luz wrote:
> On 7/9/21 9:44 PM, Pali Rohár wrote:
> > On Friday 09 July 2021 21:27:51 Maximilian Luz wrote:
> > > On 7/9/21 8:44 PM, Pali Rohár wrote:
> > > 
> > > [...]
> > > 
> > > > > My (very) quick attempt ('echo 1 > /sys/bus/pci/.../reset) at
> > > > > reproducing this didn't work, so I think at very least a network
> > > > > connection needs to be active.
> > > > 
> > > > This is doing PCIe function level reset. Maybe you can get more luck
> > > > with PCIe Hot Reset. See following link how to trigger PCIe Hot Reset
> > > > from userspace: https://alexforencich.com/wiki/en/pcie/hot-reset-linux
> > > 
> > > Thanks for that link! That does indeed do something which breaks the
> > > adapter. Running the script produces
> > > 
> > >    [  178.388414] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> > >    [  178.389128] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> > >    [  178.461365] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
> > >    [  178.461373] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
> > >    [  178.984106] pci 0000:01:00.0: [11ab:2b38] type 00 class 0x020000
> > >    [  178.984161] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit pref]
> > >    [  178.984193] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x000fffff 64bit pref]
> > >    [  178.984430] pci 0000:01:00.0: supports D1 D2
> > >    [  178.984434] pci 0000:01:00.0: PME# supported from D0 D1 D3hot D3cold
> > >    [  178.984871] pcieport 0000:00:1c.0: ASPM: current common clock configuration is inconsistent, reconfiguring
> > >    [  179.297919] pci 0000:01:00.0: BAR 0: assigned [mem 0xd4400000-0xd44fffff 64bit pref]
> > >    [  179.297961] pci 0000:01:00.0: BAR 2: assigned [mem 0xd4500000-0xd45fffff 64bit pref]
> > >    [  179.298316] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002)
> > >    [  179.298752] mwifiex_pcie: PCI memory map Virt0: 00000000c4593df1 PCI memory map Virt2: 0000000039d67daf
> > >    [  179.300522] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
> > >    [  179.300552] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
> > >    [  179.300622] mwifiex_pcie 0000:01:00.0: Read register failed
> > >    [  179.300912] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
> > >    [  179.300928] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
> > > 
> > > after which the card is unusable (there is no WiFi interface availabel
> > > any more). Reloading the driver module doesn't help and produces
> > > 
> > >    [  376.906833] mwifiex_pcie: PCI memory map Virt0: 0000000025149d28 PCI memory map Virt2: 00000000c4593df1
> > >    [  376.907278] mwifiex_pcie 0000:01:00.0: WLAN read winner status failed!
> > >    [  376.907281] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
> > >    [  376.907293] mwifiex_pcie 0000:01:00.0: Read register failed
> > >    [  376.907404] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
> > >    [  376.907406] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
> > > 
> > > again. Performing a function level reset produces
> > > 
> > >    [  402.489572] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_prepare: adapter structure is not valid
> > >    [  403.514219] mwifiex_pcie 0000:01:00.0: mwifiex_pcie_reset_done: adapter structure is not valid
> > > 
> > > and doesn't help either.
> > 
> > More Qualcomm/Atheros wifi cards are broken in a way that they stop
> > responding after PCIe Hot Reset and completely disappear from the PCIe
> > bus. It is possible that similar issue have also these Marvell cards?
> > 
> > As now we know that bride does not support hotplug it cannot inform
> > system when card disconnect from the bus. The one possible way how to
> > detect if PCIe card is available at specific address is trying to read
> > its device and vendor id. Kernel caches device/vendor id, so from
> > userspace is needed to call lspci with -b argument (to ignore kernel's
> > reported cached value). Could you provide output of following command
> > after you do PCIe Hot Reset?
> > 
> >      lspci -s 01:00.0 -b -vv
> > 
> > (and compare with output which you have already provided if there are
> > any differences)
> 
> There do seem to be some differences, specifically regarding memory.
> See https://paste.ubuntu.com/p/Rz2CDjwkCv/ for the full output.

Output is OK, card is alive and available on bus. So at least we are not
dealing with the similar issue like are observed in Qualcomm/Atheros.
Missing memory just means that driver has not enabled memory access
(because it did not successfully initialized).

Anyway, card is reporting error via AER:

	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 14, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 40000001 0000000f d4500c58 00000000

There is (UESta) marked that Unsupported Request happened and in
HeaderLog is information about error. Unfortunately I'm not skilled to
decode it, so maybe other PCI developers could say if there is something
important in HeaderLog or not.

> > If PCIe Hot Reset is breaking the card then the only option how to
> > "reset" card into working state again is PCIe Warm Reset. Unfortunately
> > there is no common mechanism how to do it from system. PCIe Warm Reset
> > is done by asserting PERST# signal on card itself, in mPCIe form factor
> > it is pin 22. In most cases pin 22 is connected to some GPIO so via GPIO
> > subsystem it could be controlled.
> > 
> > > Running the same command on a kernel with (among other) this patch
> > > unfortunately also breaks the adapter in the same way. As far as I can
> > > tell though, it doesn't run through the reset code added by this patch
> > > (as indicated by the log message when performing FLR), which I assume
> > > in a non-forced scenario, e.g. firmware issues (which IIRC is why this
> > > patch exists), it would?
> > 
> > Err... I have caught this part. Is proposed patch able to recover also
> > from state which happens after PCIe Hot Reset?
> 
> I'm not sure at this point if the power-cycle through D3cold would fix
> this (I think it might, but have no evidence for that). This patch alone
> isn't able to recover the device, as, when triggering the hot-reset via
> that script, the code never seems to run mwifiex_pcie_reset_d3cold_quirk().
> 
> If I remember correctly, the main issue was that the firmware state
> doesn't get reset completely. This can be somewhat observed when doing
> 'echo 1 > /sys/bus/pci/devices/.../reset' via the difference in log
> messages:
> 
> For an unpatched kernel:
> 
>   [   64.159509] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
>   [   64.159546] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
>   [   64.159922] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
>   [   65.240272] mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip FW dnld
>   [   65.240285] mwifiex_pcie 0000:01:00.0: WLAN FW is active
>   [   65.327359] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21)
>   [   65.327370] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.19.p21)
> 
> For a patched kernel:
> 
>   [   41.966094] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
>   [   41.966451] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
>   [   41.967227] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
>   [   42.063543] mwifiex_pcie 0000:01:00.0: Using reset_d3cold quirk to perform FW reset
>   [   42.063558] mwifiex_pcie 0000:01:00.0: putting into D3cold...
>   [   42.081010] usb 1-6: USB disconnect, device number 9
>   [   42.339922] pcieport 0000:00:1c.0: putting into D3cold...
>   [   42.425766] pcieport 0000:00:1c.0: putting into D0...
>   [   42.695987] mwifiex_pcie 0000:01:00.0: putting into D0...
>   [   42.956673] mwifiex_pcie 0000:01:00.0: enabling device (0000 -> 0002)
>   [   45.012736] mwifiex_pcie 0000:01:00.0: info: FW download over, size 723540 bytes
>   [   45.740882] usb 1-6: new high-speed USB device number 10 using xhci_hcd
>   [   45.881294] usb 1-6: New USB device found, idVendor=1286, idProduct=204c, bcdDevice=32.01
>   [   45.881308] usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
>   [   45.881313] usb 1-6: Product: Bluetooth and Wireless LAN Composite Device
>   [   45.881318] usb 1-6: Manufacturer: Marvell
>   [   45.881322] usb 1-6: SerialNumber: 0000000000000000
>   [   45.884882] Bluetooth: hci0: unexpected event for opcode 0x0000
>   [   45.885128] Bluetooth: hci0: unexpected event for opcode 0x0000
>   [   45.966218] mwifiex_pcie 0000:01:00.0: WLAN FW is active
>   [   46.157474] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21)
>   [   46.157485] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.19.p21)

Interesting... putting PCIe card into D3cold cause shutdown also of USB
bluetooth part.

> Note the absence of "WLAN FW already running! Skip FW dnld" on the
> second log. Due to the power cycle we're essentially forcing the device
> to re-download and re-initialize its firmware.

Yea, this proves that putting into D3cold and returning back to D0 cause
full reset of the card, both PCIe and USB parts. ARM CPU on this card
therefore must be reset and so new full firmware download and start it
again is needed.

> On an unpatched kernel,
> it looks like the firmware itself is kept and state may not be cleared
> properly. So in other words it looks like the firmware, while being
> prompted to do a reset, doesn't do that properly (at least when it has
> crashed before).
> 
> IIRC this then allowed us to recover from firmware issues that the
> "normal" firmware reset that mwifiex supposedly performs on a FLR
> (or reloading driver modules) didn't help with.

PCIe Function Level Reset should reset only one PCIe part of device. And
seems that this type of reset does not work properly in some situations.

Note that PCIe Function Level Reset is independent of running firmware.
It is implement in hardware and (should) work also at early stage when
firmware is not loaded yet.

I'm starting to think more and more if quirk in this patch really needs
to be behind DMI check and if rather it should not be called on other
platforms too?

> I'm not so sure any more if resetting actively caused issues or if it
> just showed different symptoms of some firmware issue that prompted us
> to do the reset in the first place (again, been quite a while since I
> last dealt with this stuff, sorry). All I know is that this patched
> reset gets the card going again.
> 
> As a side note: There are also more patches by Jonas (and Tsuchiya?)
> building on top of the quirk implementation introduced here which
> significantly reduce the need for doing resets in the first place
> (nevertheless having a reset that actually does properly reset the
> device is a good thing IMHO). Those patches can for example be found
> here:
> 
>   https://github.com/linux-surface/kernel/compare/eaaf96ba58a5fe5999b89fe3afaded74caa96480...989c8725a6d4e62db6370dd0fefe45498274d3ce
> 
> > > > > Unfortunately I can't test that with a
> > > > > network connection (and without compiling a custom kernel for which I
> > > > > don't have the time right now) because there's currently another bug
> > > > > deadlocking on device removal if there's an active connection during
> > > > > removal (which also seems to trigger on reset). That one ill be fixed
> > > > > by
> > > > > 
> > > > >     https://lore.kernel.org/linux-wireless/20210515024227.2159311-1-briannorris@chromium.org/
> > > > > 
> > > > > Jonas might know more.
> > > 
> > > [...]
Maximilian Luz July 9, 2021, 10:25 p.m. UTC | #14
On 7/9/21 11:25 PM, Pali Rohár wrote:

[...]

> PCIe Function Level Reset should reset only one PCIe part of device. And
> seems that this type of reset does not work properly in some situations.
> 
> Note that PCIe Function Level Reset is independent of running firmware.
> It is implement in hardware and (should) work also at early stage when
> firmware is not loaded yet.
> 
> I'm starting to think more and more if quirk in this patch really needs
> to be behind DMI check and if rather it should not be called on other
> platforms too?

Maybe? I'm not sure how well this behaves on other devices and if there
even are any devices outside of the MS Surface line that really require
or benefit from something like this. To me it seems safer to put it
behind quirks, at least until we know more.

Also not sure if this is just my bias, but it feels like the Surface
line always had more problems with that driver (and firmware) than
others.  I'm honestly a bit surprised that MS stuck with them for this
long (they decided to go with Intel for 7th gen devices). AFAICT they
initially chose Marvell due to connected standby support, so maybe that
causes issue for us and others simply aren't using that feature? Just
guessing though.
Pali Rohár July 9, 2021, 10:54 p.m. UTC | #15
On Saturday 10 July 2021 00:25:36 Maximilian Luz wrote:
> On 7/9/21 11:25 PM, Pali Rohár wrote:
> 
> [...]
> 
> > PCIe Function Level Reset should reset only one PCIe part of device. And
> > seems that this type of reset does not work properly in some situations.
> > 
> > Note that PCIe Function Level Reset is independent of running firmware.
> > It is implement in hardware and (should) work also at early stage when
> > firmware is not loaded yet.
> > 
> > I'm starting to think more and more if quirk in this patch really needs
> > to be behind DMI check and if rather it should not be called on other
> > platforms too?
> 
> Maybe? I'm not sure how well this behaves on other devices and if there
> even are any devices outside of the MS Surface line that really require
> or benefit from something like this. To me it seems safer to put it
> behind quirks, at least until we know more.

This is a good argument for DMI based quirk.

But at the end, PCI people should look at this patch and say what do
they thing about it, if it is better to enable it only for specific set
of machines (Surface) or it is better to enable it for every 88W8xxx
wifi or for none.

> Also not sure if this is just my bias, but it feels like the Surface
> line always had more problems with that driver (and firmware) than
> others.

Ehm, really? I see reports also from non-Surface users about bad quality
of these 88W[89]xxx cards and repeating firmware issues. I have bad
personal experience with 88W8997 SDIO firmware and lot of times I get
advice about ex-Marvell/NXP wifi cards "do not touch and run away
quickly".

I think that more people if they get mPCIe/M.2 wifi card in laptop which
does not work, they just replace it with some working one. And not
spending infinite time in trying to fix it... So this may explain why
there are more Surface users with these issues...

> I'm honestly a bit surprised that MS stuck with them for this
> long (they decided to go with Intel for 7th gen devices). AFAICT they
> initially chose Marvell due to connected standby support, so maybe that
> causes issue for us and others simply aren't using that feature? Just
> guessing though.

In my opinion that "Connected Standby" is just MS marketing term.

88W[89]xxx chips using full-mac firmware and drivers [*]. Full-mac lot
of times causing more issues than soft-mac solution. Moreover this
Marvell firmware implements also other "application" layers in firmware
which OS drivers can use, e.g. there is fully working "wpa-supplicant"
replacement and also AP part. Maybe windows drivers are using it and it
cause less problems? Duno. mwifiex uses only "low level" commands and
WPA state machine is implemented in userspace wpa-supplicant daemon.


[*] - Small note: There are also soft-mac firmwares and drivers but
apparently Marvell has never finished linux driver and firmware was not
released to public...

And there is also Laird Connectivity which offers their own proprietary
linux kernel drivers with their own firmware for these 88W[89]xxx chips.
Last time I checked it they released some parts of driver on github.
Maybe somebody could contact Laird or check if their driver can be
replaced by mwifiex? Or just replacing ex-Marvell/NXP firmware by their?
But I'm not sure if they have something for 88W8897.
Maximilian Luz July 10, 2021, midnight UTC | #16
On 7/10/21 12:54 AM, Pali Rohár wrote:

[...]

>> Also not sure if this is just my bias, but it feels like the Surface
>> line always had more problems with that driver (and firmware) than
>> others.
> 
> Ehm, really? I see reports also from non-Surface users about bad quality
> of these 88W[89]xxx cards and repeating firmware issues. I have bad
> personal experience with 88W8997 SDIO firmware and lot of times I get
> advice about ex-Marvell/NXP wifi cards "do not touch and run away
> quickly".

Yeah, then I'm probably biased since I'm mostly dealing with Surface
stuff.

> I think that more people if they get mPCIe/M.2 wifi card in laptop which
> does not work, they just replace it with some working one. And not
> spending infinite time in trying to fix it... So this may explain why
> there are more Surface users with these issues...

That might be an explanation. If it wouldn't need a heat-gun to open it
up, I'd probably have done that at some point in the past (there were
times when WiFi at my Uni was pretty much unusable with this device...
and I'm still not sure what fixed that or even if it's fixed completely).

>> I'm honestly a bit surprised that MS stuck with them for this
>> long (they decided to go with Intel for 7th gen devices). AFAICT they
>> initially chose Marvell due to connected standby support, so maybe that
>> causes issue for us and others simply aren't using that feature? Just
>> guessing though.
> 
> In my opinion that "Connected Standby" is just MS marketing term.

I can only really repeat what I've been told: Apparently when they
started designing those devices, the only option with "Connected
standby" (or probably rather that feature set that MS wanted) was,
unfortunately for us, Marvell.

> 88W[89]xxx chips using full-mac firmware and drivers [*]. Full-mac lot
> of times causing more issues than soft-mac solution. Moreover this
> Marvell firmware implements also other "application" layers in firmware
> which OS drivers can use, e.g. there is fully working "wpa-supplicant"
> replacement and also AP part. Maybe windows drivers are using it and it
> cause less problems? Duno. mwifiex uses only "low level" commands and
> WPA state machine is implemented in userspace wpa-supplicant daemon.
> 
> [*] - Small note: There are also soft-mac firmwares and drivers but
> apparently Marvell has never finished linux driver and firmware was not
> released to public...
> 
> And there is also Laird Connectivity which offers their own proprietary
> linux kernel drivers with their own firmware for these 88W[89]xxx chips.
> Last time I checked it they released some parts of driver on github.
> Maybe somebody could contact Laird or check if their driver can be
> replaced by mwifiex? Or just replacing ex-Marvell/NXP firmware by their?
> But I'm not sure if they have something for 88W8897.

Interesting, I was not aware of this. IIRC we've been experimenting with
the mwlwifi driver (which that lrdmwl driver seems to be based on?), but
couldn't get that to work with the firmware we have. IIRC it also didn't
work with the Windows firmware (which seems to be significantly
different from the one we have for Linux and seems to use or be modeled
after some special Windows WiFi driver interface).
Pali Rohár July 10, 2021, 12:07 a.m. UTC | #17
On Saturday 10 July 2021 02:00:08 Maximilian Luz wrote:
> On 7/10/21 12:54 AM, Pali Rohár wrote:
> 
> [...]
> 
> > > Also not sure if this is just my bias, but it feels like the Surface
> > > line always had more problems with that driver (and firmware) than
> > > others.
> > 
> > Ehm, really? I see reports also from non-Surface users about bad quality
> > of these 88W[89]xxx cards and repeating firmware issues. I have bad
> > personal experience with 88W8997 SDIO firmware and lot of times I get
> > advice about ex-Marvell/NXP wifi cards "do not touch and run away
> > quickly".
> 
> Yeah, then I'm probably biased since I'm mostly dealing with Surface
> stuff.
> 
> > I think that more people if they get mPCIe/M.2 wifi card in laptop which
> > does not work, they just replace it with some working one. And not
> > spending infinite time in trying to fix it... So this may explain why
> > there are more Surface users with these issues...
> 
> That might be an explanation. If it wouldn't need a heat-gun to open it
> up, I'd probably have done that at some point in the past (there were
> times when WiFi at my Uni was pretty much unusable with this device...
> and I'm still not sure what fixed that or even if it's fixed completely).
> 
> > > I'm honestly a bit surprised that MS stuck with them for this
> > > long (they decided to go with Intel for 7th gen devices). AFAICT they
> > > initially chose Marvell due to connected standby support, so maybe that
> > > causes issue for us and others simply aren't using that feature? Just
> > > guessing though.
> > 
> > In my opinion that "Connected Standby" is just MS marketing term.
> 
> I can only really repeat what I've been told: Apparently when they
> started designing those devices, the only option with "Connected
> standby" (or probably rather that feature set that MS wanted) was,
> unfortunately for us, Marvell.
> 
> > 88W[89]xxx chips using full-mac firmware and drivers [*]. Full-mac lot
> > of times causing more issues than soft-mac solution. Moreover this
> > Marvell firmware implements also other "application" layers in firmware
> > which OS drivers can use, e.g. there is fully working "wpa-supplicant"
> > replacement and also AP part. Maybe windows drivers are using it and it
> > cause less problems? Duno. mwifiex uses only "low level" commands and
> > WPA state machine is implemented in userspace wpa-supplicant daemon.
> > 
> > [*] - Small note: There are also soft-mac firmwares and drivers but
> > apparently Marvell has never finished linux driver and firmware was not
> > released to public...
> > 
> > And there is also Laird Connectivity which offers their own proprietary
> > linux kernel drivers with their own firmware for these 88W[89]xxx chips.
> > Last time I checked it they released some parts of driver on github.
> > Maybe somebody could contact Laird or check if their driver can be
> > replaced by mwifiex? Or just replacing ex-Marvell/NXP firmware by their?
> > But I'm not sure if they have something for 88W8897.
> 
> Interesting, I was not aware of this. IIRC we've been experimenting with
> the mwlwifi driver (which that lrdmwl driver seems to be based on?), but
> couldn't get that to work with the firmware we have.

mwlwifi is that soft-mac driver and uses completely different firmware.
For sure it would not work with current full-mac firmware.

> IIRC it also didn't
> work with the Windows firmware (which seems to be significantly
> different from the one we have for Linux and seems to use or be modeled
> after some special Windows WiFi driver interface).

So... Microsoft has different firmware for this chip? And it is working
with mwifiex driver?
Maximilian Luz July 10, 2021, 12:18 a.m. UTC | #18
On 7/10/21 2:07 AM, Pali Rohár wrote:

[...]

>> Interesting, I was not aware of this. IIRC we've been experimenting with
>> the mwlwifi driver (which that lrdmwl driver seems to be based on?), but
>> couldn't get that to work with the firmware we have.
> 
> mwlwifi is that soft-mac driver and uses completely different firmware.
> For sure it would not work with current full-mac firmware.
> 
>> IIRC it also didn't
>> work with the Windows firmware (which seems to be significantly
>> different from the one we have for Linux and seems to use or be modeled
>> after some special Windows WiFi driver interface).
> 
> So... Microsoft has different firmware for this chip? And it is working
> with mwifiex driver?

I'm not sure how special that firmware really is (i.e. if it is Surface
specific or just what Marvell uses on Windows), only that it doesn't
look like the firmware included in the linux-firmware repo. The Windows
firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and
on Linux we use the official firmware from the linux-firmware repo.
Pali Rohár July 10, 2021, 12:38 a.m. UTC | #19
On Saturday 10 July 2021 02:18:12 Maximilian Luz wrote:
> On 7/10/21 2:07 AM, Pali Rohár wrote:
> 
> [...]
> 
> > > Interesting, I was not aware of this. IIRC we've been experimenting with
> > > the mwlwifi driver (which that lrdmwl driver seems to be based on?), but
> > > couldn't get that to work with the firmware we have.
> > 
> > mwlwifi is that soft-mac driver and uses completely different firmware.
> > For sure it would not work with current full-mac firmware.
> > 
> > > IIRC it also didn't
> > > work with the Windows firmware (which seems to be significantly
> > > different from the one we have for Linux and seems to use or be modeled
> > > after some special Windows WiFi driver interface).
> > 
> > So... Microsoft has different firmware for this chip? And it is working
> > with mwifiex driver?
> 
> I'm not sure how special that firmware really is (i.e. if it is Surface
> specific or just what Marvell uses on Windows), only that it doesn't
> look like the firmware included in the linux-firmware repo. The Windows
> firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and
> on Linux we use the official firmware from the linux-firmware repo.

Version available in the linux-firmware repo is also what big companies
(like google) receive for their systems... sometimes just only older
version as Marvell/NXP is slow in updating files in linux-firmware.
Seems that it is also same what receive customers under NDA as more
companies dropped "proprietary" ex-Marvell/NXP driver on internet and it
contained this firmware with some sources of driver which looks like a
fork of mwifiex (or maybe mwifiex is "cleaned fork" of that driver :D)

There is old firmware documentation which describe RPC communication
between OS and firmware:
http://wiki.laptop.org/images/f/f3/Firmware-Spec-v5.1-MV-S103752-00.pdf

It is really old for very old wifi chips and when I checked it, it still
matches what mwifiex is doing with new chips. Just there are new and
more commands. And documentation is OS-neutral.

So if Microsoft has some "incompatible" firmware with this, it could
mean that they got something special which nobody else have? Maybe it
can explain that "connected standby" and maybe also better stability?

Or just windows distribute firmware in different format and needs to
"unpack" or "preprocess" prior downloading it to device?
Maximilian Luz July 10, 2021, 1:07 a.m. UTC | #20
On 7/10/21 2:38 AM, Pali Rohár wrote:
> On Saturday 10 July 2021 02:18:12 Maximilian Luz wrote:
>> On 7/10/21 2:07 AM, Pali Rohár wrote:
>>
>> [...]
>>
>>>> Interesting, I was not aware of this. IIRC we've been experimenting with
>>>> the mwlwifi driver (which that lrdmwl driver seems to be based on?), but
>>>> couldn't get that to work with the firmware we have.
>>>
>>> mwlwifi is that soft-mac driver and uses completely different firmware.
>>> For sure it would not work with current full-mac firmware.
>>>
>>>> IIRC it also didn't
>>>> work with the Windows firmware (which seems to be significantly
>>>> different from the one we have for Linux and seems to use or be modeled
>>>> after some special Windows WiFi driver interface).
>>>
>>> So... Microsoft has different firmware for this chip? And it is working
>>> with mwifiex driver?
>>
>> I'm not sure how special that firmware really is (i.e. if it is Surface
>> specific or just what Marvell uses on Windows), only that it doesn't
>> look like the firmware included in the linux-firmware repo. The Windows
>> firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and
>> on Linux we use the official firmware from the linux-firmware repo.
> 
> Version available in the linux-firmware repo is also what big companies
> (like google) receive for their systems... sometimes just only older
> version as Marvell/NXP is slow in updating files in linux-firmware.
> Seems that it is also same what receive customers under NDA as more
> companies dropped "proprietary" ex-Marvell/NXP driver on internet and it
> contained this firmware with some sources of driver which looks like a
> fork of mwifiex (or maybe mwifiex is "cleaned fork" of that driver :D)
> 
> There is old firmware documentation which describe RPC communication
> between OS and firmware:
> http://wiki.laptop.org/images/f/f3/Firmware-Spec-v5.1-MV-S103752-00.pdf
> 
> It is really old for very old wifi chips and when I checked it, it still
> matches what mwifiex is doing with new chips. Just there are new and
> more commands. And documentation is OS-neutral.
> 
> So if Microsoft has some "incompatible" firmware with this, it could
> mean that they got something special which nobody else have? Maybe it
> can explain that "connected standby" and maybe also better stability?
> 
> Or just windows distribute firmware in different format and needs to
> "unpack" or "preprocess" prior downloading it to device?

If memory serves me right, Jonas did some reverse engineering on the
Windows driver and found that it uses the "new" WDI Miniport API: It
seems that originally both Windows and Linux drivers (and firmware)
were pretty much the same (he mentioned there were similarities in
terminology), but then they switched to that new API on Windows and
changed the firmware with it, so that the driver now essentially only
forwards the commands from that API to the firmware and the firmware
handles the rest.

By reading the Windows docs on that API, that change might have been
forced on them as some Windows 10 features apparently only work via
that API.

He'll probably know more about that than I do.
Jonas Dreßler July 11, 2021, 4:31 p.m. UTC | #21
On 7/9/21 7:30 PM, Pali Rohár wrote:
> On Friday 09 July 2021 19:03:37 Maximilian Luz wrote:
>> On 7/9/21 6:12 PM, Pali Rohár wrote:
>>
>> [...]
>>
>>>>> Hello! Now I'm thinking loudly about this patch. Why this kind of reset
>>>>> is needed only for Surface devices? AFAIK these 88W8897 chips are same
>>>>> in all cards. Chip itself implements PCIe interface (and also SDIO) so
>>>>> for me looks very strange if this 88W8897 PCIe device needs DMI specific
>>>>> quirks. I cannot believe that Microsoft got some special version of
>>>>> these chips from Marvell which are different than version uses on cards
>>>>> in mPCIe form factor.
>>>>>
>>>>> And now when I'm reading comment below about PCIe bridge to which is
>>>>> this 88W8897 PCIe chip connected, is not this rather an issue in that
>>>>> PCIe bridge (instead of mwifiex/88W8897) or in ACPI firmware which
>>>>> controls this bridge?
>>>>>
>>>>> Or are having other people same issues on mPCIe form factor wifi cards
>>>>> with 88W8897 chips and then this quirk should not DMI dependent?
>>>>>
>>>>> Note that I'm seeing issues with reset and other things also on chip
>>>>> 88W8997 when is connected to system via SDIO. These chips have both PCIe
>>>>> and SDIO buses, it just depends which pins are used.
>>>>>
>>>>
>>>> Hi and thanks for the quick reply! Honestly I've no idea, this is just the
>>>> first method we found that allows for a proper reset of the chip. What I
>>>> know is that some Surface devices need that ACPI DSM call (the one that was
>>>> done in the commit I dropped in this version of the patchset) to reset the
>>>> chip instead of this method.
>>>>
>>>> Afaik other devices with this chip don't need this resetting method, at
>>>> least Marvell employees couldn't reproduce the issues on their testing
>>>> devices.
>>>>
>>>> So would you suggest we just try to match for the pci chip 88W8897 instead?
>>>
>>> Hello! Such suggestion makes sense when we know that it is 88W8897
>>> issue. But if you got information that issue cannot be reproduced on
>>> other 88W8897 cards then matching 88W8897 is not correct.
>>>
>>>   From all this information looks like that it is problem in (Microsoft?)
>>> PCIe bridge to which is card connected. Otherwise I do not reason how it
>>> can be 88W8897 affected. Either it is reproducible on 88W8897 cards also
>>> in other devices or issue is not on 88W8897 card.
>>
>> I doubt that it's an issue with the PCIe bridge (itself at least). The
>> same type of bridge is used for both dGPU and NVME SSD on my device (see
>> lspci output below) and those work fine. Also if I'm seeing that right
>> it's from the Intel CPU, so my guess is that a lot more people would
>> have issues with that then.
> 
>  From information below it seems to be related to surprise removal.
> Therefore is surprise removal working without issue for dGPU or NVME
> SSD? Not all PCIe bridges support surprise removal...
> 
>>
>> I don't know about the hardware side, so it might be possible that it's
>> an issue with integrating both bridge and wifi chip, in which case it's
>> still probably best handled via DMI quirks unless we know more.
>>
>> Also as Tsuchiya mentioned in his original submission, on Windows the
>> device is reset via this D3cold method. I've only skimmed that
>> errata.inf file mentioned, but I think this is what he's referring to:
>>
>>    Controls whether ACPIDeviceEnableD3ColdOnSurpriseRemoval rule will be
>>    evaluated or not on a given platform. Currently
>>    ACPIDeviceEnableD3ColdOnSurpriseRemoval rule only needs to be
>>    evaluated on Surface platforms which contain the Marvell WiFi
>>    controller which depends on device going through D3Cold as part of
>>    surprise-removal.
>>
>> and
>>
>>    Starting with Windows releases *after* Blue, ACPI will not put
>>    surprise-removed devices into D3Cold automatically. Some known
>>    scenarios (viz. WiFi reset/recovery) rely on the device cycling
>>    through D3Cold on surprise-removal. This hack allows surprise-removed
>>    devices to be put into D3Cold (if supported by the stack).
>>
>> So, as far as I can tell, the chip doesn't like to be surprise-removed
>> (which seems to happen during reset) and then needs to be power-cycled,
>> which I think is likely due to some issue with firmware state.
> 
> Thanks for information. This really does not look like PCIe bridge
> specific if bridge itself can handle surprise-removed devices. lspci can
> tell us if bridge supports it or not (see below).
> 
>> So the quirk on Windows seems very Surface specific.
>>
>> There also seem a bunch of revisions of these chips around, for example
>> my SB2 is affected by a bug that we've tied to the specific hardware
>> revision which causes some issues with host-sleep (IIRC chip switches
>> rapidly between wake and sleep states without any external influence,
>> which is not how it should behave and how it does behave on a later
>> hardware revision).
> 
> Interesting... This looks like the issue can be in 88W8897 chip and
> needs some special conditions to trigger? And Surface is triggering it
> always?

The specific issue was that the card wakes up very quickly after going 
into deep sleep (deep sleep is a state the firmware enters when idle and 
not connected to an AP). Now this in turn messed with the host 
suspending, because deep sleep is involved there and the card is not 
expected to wake up that quickly again (I'm oversimplifying here, it's 
also been some time since we looked into it).

Anyway, in the end those wakeups from deep sleep only happened with one 
hardware revision of the card (I guess it's caused by a hardware design 
issue like a floating gpio or something), and we managed to fix the 
problems by disabling deep sleep on that hardware revision, but as Max 
mentioned that problem is completely unrelated from this patch.

> 
>>>> Then we'd probably have to check if there are any laptops where multiple
>>>> devices are connected to the pci bridge as Amey suggested in a review
>>>> before.
>>>
>>> Well, I do not know... But if this is issue with PCIe bridge then
>>> similar issue could be observed also for other PCIe devices with this
>>> PCIe bridge. But question is if there are other laptops with this PCIe
>>> bridge. And also it can be a problem in ACPI firmware on those Surface
>>> devices, which implements some PCIe bridge functionality. So it is
>>> possible that issue is with PCIe bridge, not in HW, but in SW/firmware
>>> part which can be Microsoft specific... So too many questions to which
>>> we do not know answers.
>>>
>>> Could you provide output of 'lspci -nn -vv' and 'lspci -tvnn' on
>>> affected machines? If you have already sent it in some previous email,
>>> just send a link. At least I'm not able to find it right now and output
>>> may contain something useful...
>>
>>  From my Surface Book 2 (with the same issue):
>>
>>   - lspci -tvnn: https://paste.ubuntu.com/p/mm3YpcZJ8N/
>>   - lspci -vv -nn: https://paste.ubuntu.com/p/dctTDP738N/
> 
> Could you re-run lspci under root account? There are missing important
> parts like "Capabilities: <access denied>" where is information if
> bridge supports surprise removal or not.
> 
>> Regards,
>> Max
Jonas Dreßler July 11, 2021, 4:53 p.m. UTC | #22
On 7/10/21 3:07 AM, Maximilian Luz wrote:
> On 7/10/21 2:38 AM, Pali Rohár wrote:
>> On Saturday 10 July 2021 02:18:12 Maximilian Luz wrote:
>>> On 7/10/21 2:07 AM, Pali Rohár wrote:
>>>
>>> [...]
>>>
>>>>> Interesting, I was not aware of this. IIRC we've been experimenting 
>>>>> with
>>>>> the mwlwifi driver (which that lrdmwl driver seems to be based 
>>>>> on?), but
>>>>> couldn't get that to work with the firmware we have.
>>>>
>>>> mwlwifi is that soft-mac driver and uses completely different firmware.
>>>> For sure it would not work with current full-mac firmware.
>>>>
>>>>> IIRC it also didn't
>>>>> work with the Windows firmware (which seems to be significantly
>>>>> different from the one we have for Linux and seems to use or be 
>>>>> modeled
>>>>> after some special Windows WiFi driver interface).
>>>>
>>>> So... Microsoft has different firmware for this chip? And it is working
>>>> with mwifiex driver?
>>>
>>> I'm not sure how special that firmware really is (i.e. if it is Surface
>>> specific or just what Marvell uses on Windows), only that it doesn't
>>> look like the firmware included in the linux-firmware repo. The Windows
>>> firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and
>>> on Linux we use the official firmware from the linux-firmware repo.
>>
>> Version available in the linux-firmware repo is also what big companies
>> (like google) receive for their systems... sometimes just only older
>> version as Marvell/NXP is slow in updating files in linux-firmware.
>> Seems that it is also same what receive customers under NDA as more
>> companies dropped "proprietary" ex-Marvell/NXP driver on internet and it
>> contained this firmware with some sources of driver which looks like a
>> fork of mwifiex (or maybe mwifiex is "cleaned fork" of that driver :D)
>>
>> There is old firmware documentation which describe RPC communication
>> between OS and firmware:
>> http://wiki.laptop.org/images/f/f3/Firmware-Spec-v5.1-MV-S103752-00.pdf
>>
>> It is really old for very old wifi chips and when I checked it, it still
>> matches what mwifiex is doing with new chips. Just there are new and
>> more commands. And documentation is OS-neutral.
>>
>> So if Microsoft has some "incompatible" firmware with this, it could
>> mean that they got something special which nobody else have? Maybe it
>> can explain that "connected standby" and maybe also better stability?
>>
>> Or just windows distribute firmware in different format and needs to
>> "unpack" or "preprocess" prior downloading it to device?
> 
> If memory serves me right, Jonas did some reverse engineering on the
> Windows driver and found that it uses the "new" WDI Miniport API: It
> seems that originally both Windows and Linux drivers (and firmware)
> were pretty much the same (he mentioned there were similarities in
> terminology), but then they switched to that new API on Windows and
> changed the firmware with it, so that the driver now essentially only
> forwards the commands from that API to the firmware and the firmware
> handles the rest.
> 
> By reading the Windows docs on that API, that change might have been
> forced on them as some Windows 10 features apparently only work via
> that API.
> 
> He'll probably know more about that than I do.

Not much I can add there, it seemed a lot like both mwifiex and the 
Windows 10 WDI miniport driver were both derived from the same codebase 
originally, but in order to be compatible with the WDI miniport API and 
other stuff Windows requires from wifi devices (I recall there was some 
SAR-value control/reporting stuff too), some parts of the firmware had 
to be rewritten.

In the end, the Windows firmware is updated a lot more often and likely 
includes a bunch of bugfixes the linux firmware doesn't have, but it 
can't be used on linux without a ton of work that would probably include 
rebuilding proprietary APIs from Windows.

Also, from my testing with custom APs and sniffing packets with 
Wireshark, the functionality, limitations and weird 
"semi-spec-compliant" behaviors were exactly the same with the Windows 
firmware: It doesn't support WPA3, it can't connect to fast transition 
APs (funnily enough that's opposed to what MS claims) and it also can't 
spawn an AP with WPA-PSK-SHA256 AKM ciphers. So not sure there's a lot 
of sense in spending more time trying to go down that path.
Pali Rohár July 11, 2021, 5:01 p.m. UTC | #23
On Sunday 11 July 2021 18:53:32 Jonas Dreßler wrote:
> On 7/10/21 3:07 AM, Maximilian Luz wrote:
> > On 7/10/21 2:38 AM, Pali Rohár wrote:
> > > On Saturday 10 July 2021 02:18:12 Maximilian Luz wrote:
> > > > On 7/10/21 2:07 AM, Pali Rohár wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > > Interesting, I was not aware of this. IIRC we've been
> > > > > > experimenting with
> > > > > > the mwlwifi driver (which that lrdmwl driver seems to be
> > > > > > based on?), but
> > > > > > couldn't get that to work with the firmware we have.
> > > > > 
> > > > > mwlwifi is that soft-mac driver and uses completely different firmware.
> > > > > For sure it would not work with current full-mac firmware.
> > > > > 
> > > > > > IIRC it also didn't
> > > > > > work with the Windows firmware (which seems to be significantly
> > > > > > different from the one we have for Linux and seems to
> > > > > > use or be modeled
> > > > > > after some special Windows WiFi driver interface).
> > > > > 
> > > > > So... Microsoft has different firmware for this chip? And it is working
> > > > > with mwifiex driver?
> > > > 
> > > > I'm not sure how special that firmware really is (i.e. if it is Surface
> > > > specific or just what Marvell uses on Windows), only that it doesn't
> > > > look like the firmware included in the linux-firmware repo. The Windows
> > > > firmware doesn't work with either mwlwifi or mwifiex drivers (IIRC) and
> > > > on Linux we use the official firmware from the linux-firmware repo.
> > > 
> > > Version available in the linux-firmware repo is also what big companies
> > > (like google) receive for their systems... sometimes just only older
> > > version as Marvell/NXP is slow in updating files in linux-firmware.
> > > Seems that it is also same what receive customers under NDA as more
> > > companies dropped "proprietary" ex-Marvell/NXP driver on internet and it
> > > contained this firmware with some sources of driver which looks like a
> > > fork of mwifiex (or maybe mwifiex is "cleaned fork" of that driver :D)
> > > 
> > > There is old firmware documentation which describe RPC communication
> > > between OS and firmware:
> > > http://wiki.laptop.org/images/f/f3/Firmware-Spec-v5.1-MV-S103752-00.pdf
> > > 
> > > It is really old for very old wifi chips and when I checked it, it still
> > > matches what mwifiex is doing with new chips. Just there are new and
> > > more commands. And documentation is OS-neutral.
> > > 
> > > So if Microsoft has some "incompatible" firmware with this, it could
> > > mean that they got something special which nobody else have? Maybe it
> > > can explain that "connected standby" and maybe also better stability?
> > > 
> > > Or just windows distribute firmware in different format and needs to
> > > "unpack" or "preprocess" prior downloading it to device?
> > 
> > If memory serves me right, Jonas did some reverse engineering on the
> > Windows driver and found that it uses the "new" WDI Miniport API: It
> > seems that originally both Windows and Linux drivers (and firmware)
> > were pretty much the same (he mentioned there were similarities in
> > terminology), but then they switched to that new API on Windows and
> > changed the firmware with it, so that the driver now essentially only
> > forwards the commands from that API to the firmware and the firmware
> > handles the rest.
> > 
> > By reading the Windows docs on that API, that change might have been
> > forced on them as some Windows 10 features apparently only work via
> > that API.
> > 
> > He'll probably know more about that than I do.
> 
> Not much I can add there, it seemed a lot like both mwifiex and the Windows
> 10 WDI miniport driver were both derived from the same codebase originally,
> but in order to be compatible with the WDI miniport API and other stuff
> Windows requires from wifi devices (I recall there was some SAR-value
> control/reporting stuff too), some parts of the firmware had to be
> rewritten.
> 
> In the end, the Windows firmware is updated a lot more often and likely
> includes a bunch of bugfixes the linux firmware doesn't have, but it can't
> be used on linux without a ton of work that would probably include
> rebuilding proprietary APIs from Windows.
> 
> Also, from my testing with custom APs and sniffing packets with Wireshark,
> the functionality, limitations and weird "semi-spec-compliant" behaviors
> were exactly the same with the Windows firmware: It doesn't support WPA3, it
> can't connect to fast transition APs (funnily enough that's opposed to what
> MS claims) and it also can't spawn an AP with WPA-PSK-SHA256 AKM ciphers. So
> not sure there's a lot of sense in spending more time trying to go down that
> path.

New version of firmware files are available on NXP portal, where are
updated more frequently, but only for companies which have NXP accounts
and signed NDA with NXP. Not for end users.

If you want these new firmware files, you need to ask NXP developers as
only they can ask for non-NDA distribution and include new version into
linux-firmware repository. Like in this pull request where is new SDIO
firmware for 88W8897:
https://lore.kernel.org/linux-firmware/DB7PR04MB453855B0D6C41923BCB0922EFC1C9@DB7PR04MB4538.eurprd04.prod.outlook.com/
diff mbox series

Patch

diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c
index a530832c9421..c6ccce426b49 100644
--- a/drivers/net/wireless/marvell/mwifiex/pcie.c
+++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
@@ -528,6 +528,13 @@  static void mwifiex_pcie_reset_prepare(struct pci_dev *pdev)
 	mwifiex_shutdown_sw(adapter);
 	clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags);
 	clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags);
+
+	/* On MS Surface gen4+ devices FLR isn't effective to recover from
+	 * hangups, so we power-cycle the card instead.
+	 */
+	if (card->quirks & QUIRK_FW_RST_D3COLD)
+		mwifiex_pcie_reset_d3cold_quirk(pdev);
+
 	mwifiex_dbg(adapter, INFO, "%s, successful\n", __func__);
 
 	card->pci_reset_ongoing = true;
diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
index 4064f99b36ba..b5f214fc1212 100644
--- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
+++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.c
@@ -15,6 +15,72 @@ 
 
 /* quirk table based on DMI matching */
 static const struct dmi_system_id mwifiex_quirk_table[] = {
+	{
+		.ident = "Surface Pro 4",
+		.matches = {
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 4"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Pro 5",
+		.matches = {
+			/* match for SKU here due to generic product name "Surface Pro" */
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1796"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Pro 5 (LTE)",
+		.matches = {
+			/* match for SKU here due to generic product name "Surface Pro" */
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_SKU, "Surface_Pro_1807"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Pro 6",
+		.matches = {
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Pro 6"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Book 1",
+		.matches = {
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Book 2",
+		.matches = {
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Book 2"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Laptop 1",
+		.matches = {
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
+	{
+		.ident = "Surface Laptop 2",
+		.matches = {
+			DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+			DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "Surface Laptop 2"),
+		},
+		.driver_data = (void *)QUIRK_FW_RST_D3COLD,
+	},
 	{}
 };
 
@@ -29,4 +95,61 @@  void mwifiex_initialize_quirks(struct pcie_service_card *card)
 
 	if (!card->quirks)
 		dev_info(&pdev->dev, "no quirks enabled\n");
+	if (card->quirks & QUIRK_FW_RST_D3COLD)
+		dev_info(&pdev->dev, "quirk reset_d3cold enabled\n");
+}
+
+static void mwifiex_pcie_set_power_d3cold(struct pci_dev *pdev)
+{
+	dev_info(&pdev->dev, "putting into D3cold...\n");
+
+	pci_save_state(pdev);
+	if (pci_is_enabled(pdev))
+		pci_disable_device(pdev);
+	pci_set_power_state(pdev, PCI_D3cold);
+}
+
+static int mwifiex_pcie_set_power_d0(struct pci_dev *pdev)
+{
+	int ret;
+
+	dev_info(&pdev->dev, "putting into D0...\n");
+
+	pci_set_power_state(pdev, PCI_D0);
+	ret = pci_enable_device(pdev);
+	if (ret) {
+		dev_err(&pdev->dev, "pci_enable_device failed\n");
+		return ret;
+	}
+	pci_restore_state(pdev);
+
+	return 0;
+}
+
+int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev)
+{
+	struct pci_dev *parent_pdev = pci_upstream_bridge(pdev);
+	int ret;
+
+	/* Power-cycle (put into D3cold then D0) */
+	dev_info(&pdev->dev, "Using reset_d3cold quirk to perform FW reset\n");
+
+	/* We need to perform power-cycle also for bridge of wifi because
+	 * on some devices (e.g. Surface Book 1), the OS for some reasons
+	 * can't know the real power state of the bridge.
+	 * When tried to power-cycle only wifi, the reset failed with the
+	 * following dmesg log:
+	 * "Cannot transition to power state D0 for parent in D3hot".
+	 */
+	mwifiex_pcie_set_power_d3cold(pdev);
+	mwifiex_pcie_set_power_d3cold(parent_pdev);
+
+	ret = mwifiex_pcie_set_power_d0(parent_pdev);
+	if (ret)
+		return ret;
+	ret = mwifiex_pcie_set_power_d0(pdev);
+	if (ret)
+		return ret;
+
+	return 0;
 }
diff --git a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
index 7a1fe3b3a61a..549093067813 100644
--- a/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
+++ b/drivers/net/wireless/marvell/mwifiex/pcie_quirks.h
@@ -5,4 +5,7 @@ 
 
 #include "pcie.h"
 
+#define QUIRK_FW_RST_D3COLD	BIT(0)
+
 void mwifiex_initialize_quirks(struct pcie_service_card *card);
+int mwifiex_pcie_reset_d3cold_quirk(struct pci_dev *pdev);