diff mbox

arm64: dts: rockchip: Fix rk3399-gru-* s2r (pinctrl hogs, wifi reset)

Message ID 20180227204711.29357-1-dianders@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Doug Anderson Feb. 27, 2018, 8:47 p.m. UTC
Back in the early days when gru devices were still under development
we found an issue where the WiFi reset line needed to be configured as
early as possible during the boot process to avoid the WiFi module
being in a bad state.

We found that the way to get the kernel to do this in the earliest
possible place was to configure this line in the pinctrl hogs, so
that's what we did.  For some history here you can see
<http://crosreview.com/368770>.  After the time that change landed in
the kernel, we landed a firmware change to configure this line even
earlier.  See <http://crosreview.com/399919>.  However, even after the
firmware change landed we kept the kernel change to deal with the fact
that some people working on devices might take a little while to
update their firmware.

At this there are definitely zero devices out in the wild that have
firmware without the fix in it.  Specifically looking in the firmware
branch several critically important fixes for memory stability landed
after the patch in coreboot and I know we didn't ship without those.
Thus, by now, everyone should have the new firmware and it's safe to
not have the kernel set this up in a pinctrl hog.

Historically, even though it wasn't needed to have this in a pinctrl
hog, we still kept it since it didn't hurt.  Pinctrl would apply the
default hog at bootup and then would never touch things again.  That
all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
during suspend/resume").  After that commit then we'll re-apply the
default hog at resume time and that can screw up the reset state of
WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong way
then the whole system can go haywire.  That's what was happening.
Specifically you'd resume a rk3399-gru-* device and it would mostly
resume, then would crash with some crazy weird crash.

One could say, perhaps, that the recent pinctrl change was at fault
(and should be fixed) since it changed behavior.  ...but that's not
really true.  The device tree for rk3399-gru is really to blame.
Specifically since the pinctrl is defined in the hog and not in the
"wlan-pd-n" node then the actual user of this pin doesn't have a
pinctrl entry for it.  That's bad.

Let's fix our problems by just moving the control of
"wlan_module_reset_l pinctrl" out of the hog and put them in the
proper place.

NOTE: in theory, I think it should actually be possible to have a pin
controlled _both_ by the hog and by an actual device.  Once the device
claims the pin I think the hog is supposed to let go.  I'm not 100%
sure that this works and in any case this solution would be more
complex than is necessary.

Reported-by: Marc Zyngier <marc.zyngier@arm.com>
Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
Fixes: 981ed1bfbc6c ("pinctrl: Really force states during suspend/resume")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

Comments

Enric Balletbo i Serra Feb. 28, 2018, 12:19 p.m. UTC | #1
Hi Doug,

On 27/02/18 21:47, Douglas Anderson wrote:
> Back in the early days when gru devices were still under development
> we found an issue where the WiFi reset line needed to be configured as
> early as possible during the boot process to avoid the WiFi module
> being in a bad state.
> 
> We found that the way to get the kernel to do this in the earliest
> possible place was to configure this line in the pinctrl hogs, so
> that's what we did.  For some history here you can see
> <http://crosreview.com/368770>.  After the time that change landed in
> the kernel, we landed a firmware change to configure this line even
> earlier.  See <http://crosreview.com/399919>.  However, even after the
> firmware change landed we kept the kernel change to deal with the fact
> that some people working on devices might take a little while to
> update their firmware.
> 
> At this there are definitely zero devices out in the wild that have
> firmware without the fix in it.  Specifically looking in the firmware
> branch several critically important fixes for memory stability landed
> after the patch in coreboot and I know we didn't ship without those.
> Thus, by now, everyone should have the new firmware and it's safe to
> not have the kernel set this up in a pinctrl hog.
> 
> Historically, even though it wasn't needed to have this in a pinctrl
> hog, we still kept it since it didn't hurt.  Pinctrl would apply the
> default hog at bootup and then would never touch things again.  That
> all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
> during suspend/resume").  After that commit then we'll re-apply the
> default hog at resume time and that can screw up the reset state of
> WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong way
> then the whole system can go haywire.  That's what was happening.
> Specifically you'd resume a rk3399-gru-* device and it would mostly
> resume, then would crash with some crazy weird crash.
> 
> One could say, perhaps, that the recent pinctrl change was at fault
> (and should be fixed) since it changed behavior.  ...but that's not
> really true.  The device tree for rk3399-gru is really to blame.
> Specifically since the pinctrl is defined in the hog and not in the
> "wlan-pd-n" node then the actual user of this pin doesn't have a
> pinctrl entry for it.  That's bad.
> 
> Let's fix our problems by just moving the control of
> "wlan_module_reset_l pinctrl" out of the hog and put them in the
> proper place.
> 
> NOTE: in theory, I think it should actually be possible to have a pin
> controlled _both_ by the hog and by an actual device.  Once the device
> claims the pin I think the hog is supposed to let go.  I'm not 100%
> sure that this works and in any case this solution would be more
> complex than is necessary.
> 
> Reported-by: Marc Zyngier <marc.zyngier@arm.com>
> Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
> Fixes: 981ed1bfbc6c ("pinctrl: Really force states during suspend/resume")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> 
>  arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi | 16 +++-------------
>  1 file changed, 3 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi b/arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi
> index 6e50768a34ce..9ad54751d0d8 100644
> --- a/arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi
> +++ b/arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi
> @@ -406,8 +406,9 @@
>  	wlan_pd_n: wlan-pd-n {
>  		compatible = "regulator-fixed";
>  		regulator-name = "wlan_pd_n";
> +		pinctrl-names = "default";
> +		pinctrl-0 = <&wlan_module_reset_l>;
>  
> -		/* Note the wlan_module_reset_l pinctrl */
>  		enable-active-high;
>  		gpio = <&gpio1 11 GPIO_ACTIVE_HIGH>;
>  
> @@ -988,12 +989,6 @@ ap_i2c_audio: &i2c8 {
>  	pinctrl-0 = <
>  		&ap_pwroff	/* AP will auto-assert this when in S3 */
>  		&clk_32k	/* This pin is always 32k on gru boards */
> -
> -		/*
> -		 * We want this driven low ASAP; firmware should help us, but
> -		 * we can help ourselves too.
> -		 */
> -		&wlan_module_reset_l
>  	>;
>  
>  	pcfg_output_low: pcfg-output-low {
> @@ -1173,12 +1168,7 @@ ap_i2c_audio: &i2c8 {
>  		};
>  
>  		wlan_module_reset_l: wlan-module-reset-l {
> -			/*
> -			 * We want this driven low ASAP (As {Soon,Strongly} As
> -			 * Possible), to avoid leakage through the powered-down
> -			 * WiFi.
> -			 */
> -			rockchip,pins = <1 11 RK_FUNC_GPIO &pcfg_output_low>;
> +			rockchip,pins = <1 11 RK_FUNC_GPIO &pcfg_pull_none>;
>  		};
>  
>  		bt_host_wake_l: bt-host-wake-l {
> 

In linux-next there are still different issues with s2r, but definitely, this
patch has allowed me to enable again the PCIE on my Samsung Chromebook Plus, s2r
goes now a bit further :) So many thanks for the patch.

Tested-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>

Regards,
  Enric
Marc Zyngier Feb. 28, 2018, 1:51 p.m. UTC | #2
On 27/02/18 20:47, Douglas Anderson wrote:
> Back in the early days when gru devices were still under development
> we found an issue where the WiFi reset line needed to be configured as
> early as possible during the boot process to avoid the WiFi module
> being in a bad state.
> 
> We found that the way to get the kernel to do this in the earliest
> possible place was to configure this line in the pinctrl hogs, so
> that's what we did.  For some history here you can see
> <http://crosreview.com/368770>.  After the time that change landed in
> the kernel, we landed a firmware change to configure this line even
> earlier.  See <http://crosreview.com/399919>.  However, even after the
> firmware change landed we kept the kernel change to deal with the fact
> that some people working on devices might take a little while to
> update their firmware.
> 
> At this there are definitely zero devices out in the wild that have
> firmware without the fix in it.  Specifically looking in the firmware
> branch several critically important fixes for memory stability landed
> after the patch in coreboot and I know we didn't ship without those.
> Thus, by now, everyone should have the new firmware and it's safe to
> not have the kernel set this up in a pinctrl hog.
> 
> Historically, even though it wasn't needed to have this in a pinctrl
> hog, we still kept it since it didn't hurt.  Pinctrl would apply the
> default hog at bootup and then would never touch things again.  That
> all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
> during suspend/resume").  After that commit then we'll re-apply the
> default hog at resume time and that can screw up the reset state of
> WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong way
> then the whole system can go haywire.  That's what was happening.
> Specifically you'd resume a rk3399-gru-* device and it would mostly
> resume, then would crash with some crazy weird crash.
> 
> One could say, perhaps, that the recent pinctrl change was at fault
> (and should be fixed) since it changed behavior.  ...but that's not
> really true.  The device tree for rk3399-gru is really to blame.
> Specifically since the pinctrl is defined in the hog and not in the
> "wlan-pd-n" node then the actual user of this pin doesn't have a
> pinctrl entry for it.  That's bad.
> 
> Let's fix our problems by just moving the control of
> "wlan_module_reset_l pinctrl" out of the hog and put them in the
> proper place.
> 
> NOTE: in theory, I think it should actually be possible to have a pin
> controlled _both_ by the hog and by an actual device.  Once the device
> claims the pin I think the hog is supposed to let go.  I'm not 100%
> sure that this works and in any case this solution would be more
> complex than is necessary.
> 
> Reported-by: Marc Zyngier <marc.zyngier@arm.com>
> Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
> Fixes: 981ed1bfbc6c ("pinctrl: Really force states during suspend/resume")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>

Tested-by: Marc Zyngier <marc.zyngier@arm.com>

Thanks,

	M.
Marc Zyngier Feb. 28, 2018, 2:01 p.m. UTC | #3
On 28/02/18 12:19, Enric Balletbo i Serra wrote:
> Hi Doug,
> 
> On 27/02/18 21:47, Douglas Anderson wrote:
>> Back in the early days when gru devices were still under development
>> we found an issue where the WiFi reset line needed to be configured as
>> early as possible during the boot process to avoid the WiFi module
>> being in a bad state.
>>
>> We found that the way to get the kernel to do this in the earliest
>> possible place was to configure this line in the pinctrl hogs, so
>> that's what we did.  For some history here you can see
>> <http://crosreview.com/368770>.  After the time that change landed in
>> the kernel, we landed a firmware change to configure this line even
>> earlier.  See <http://crosreview.com/399919>.  However, even after the
>> firmware change landed we kept the kernel change to deal with the fact
>> that some people working on devices might take a little while to
>> update their firmware.
>>
>> At this there are definitely zero devices out in the wild that have
>> firmware without the fix in it.  Specifically looking in the firmware
>> branch several critically important fixes for memory stability landed
>> after the patch in coreboot and I know we didn't ship without those.
>> Thus, by now, everyone should have the new firmware and it's safe to
>> not have the kernel set this up in a pinctrl hog.
>>
>> Historically, even though it wasn't needed to have this in a pinctrl
>> hog, we still kept it since it didn't hurt.  Pinctrl would apply the
>> default hog at bootup and then would never touch things again.  That
>> all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
>> during suspend/resume").  After that commit then we'll re-apply the
>> default hog at resume time and that can screw up the reset state of
>> WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong way
>> then the whole system can go haywire.  That's what was happening.
>> Specifically you'd resume a rk3399-gru-* device and it would mostly
>> resume, then would crash with some crazy weird crash.
>>
>> One could say, perhaps, that the recent pinctrl change was at fault
>> (and should be fixed) since it changed behavior.  ...but that's not
>> really true.  The device tree for rk3399-gru is really to blame.
>> Specifically since the pinctrl is defined in the hog and not in the
>> "wlan-pd-n" node then the actual user of this pin doesn't have a
>> pinctrl entry for it.  That's bad.
>>
>> Let's fix our problems by just moving the control of
>> "wlan_module_reset_l pinctrl" out of the hog and put them in the
>> proper place.
>>
>> NOTE: in theory, I think it should actually be possible to have a pin
>> controlled _both_ by the hog and by an actual device.  Once the device
>> claims the pin I think the hog is supposed to let go.  I'm not 100%
>> sure that this works and in any case this solution would be more
>> complex than is necessary.
>>
>> Reported-by: Marc Zyngier <marc.zyngier@arm.com>
>> Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
>> Fixes: 981ed1bfbc6c ("pinctrl: Really force states during suspend/resume")
>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>> ---
>>
>>  arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi | 16 +++-------------
>>  1 file changed, 3 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi b/arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi
>> index 6e50768a34ce..9ad54751d0d8 100644
>> --- a/arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi
>> +++ b/arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi
>> @@ -406,8 +406,9 @@
>>  	wlan_pd_n: wlan-pd-n {
>>  		compatible = "regulator-fixed";
>>  		regulator-name = "wlan_pd_n";
>> +		pinctrl-names = "default";
>> +		pinctrl-0 = <&wlan_module_reset_l>;
>>  
>> -		/* Note the wlan_module_reset_l pinctrl */
>>  		enable-active-high;
>>  		gpio = <&gpio1 11 GPIO_ACTIVE_HIGH>;
>>  
>> @@ -988,12 +989,6 @@ ap_i2c_audio: &i2c8 {
>>  	pinctrl-0 = <
>>  		&ap_pwroff	/* AP will auto-assert this when in S3 */
>>  		&clk_32k	/* This pin is always 32k on gru boards */
>> -
>> -		/*
>> -		 * We want this driven low ASAP; firmware should help us, but
>> -		 * we can help ourselves too.
>> -		 */
>> -		&wlan_module_reset_l
>>  	>;
>>  
>>  	pcfg_output_low: pcfg-output-low {
>> @@ -1173,12 +1168,7 @@ ap_i2c_audio: &i2c8 {
>>  		};
>>  
>>  		wlan_module_reset_l: wlan-module-reset-l {
>> -			/*
>> -			 * We want this driven low ASAP (As {Soon,Strongly} As
>> -			 * Possible), to avoid leakage through the powered-down
>> -			 * WiFi.
>> -			 */
>> -			rockchip,pins = <1 11 RK_FUNC_GPIO &pcfg_output_low>;
>> +			rockchip,pins = <1 11 RK_FUNC_GPIO &pcfg_pull_none>;
>>  		};
>>  
>>  		bt_host_wake_l: bt-host-wake-l {
>>
> 
> In linux-next there are still different issues with s2r, but definitely, this
> patch has allowed me to enable again the PCIE on my Samsung Chromebook Plus, s2r
> goes now a bit further :) So many thanks for the patch.
> 
> Tested-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
For the record, I maintain a small branch[1] based on the latest
mainline, containing fixes only, and not dragging the whole of -next
into an already fairly unstable kernel. Most things do work (including
s2r), apart from USB3, external display, and some of the sound stuff.
Crucially, kexec is now functional, allowing the kernel to be used as a
bootloader...

	M.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=hack/kevin-4.16
Heiko Stübner March 1, 2018, 8:43 a.m. UTC | #4
Am Dienstag, 27. Februar 2018, 21:47:11 CET schrieb Douglas Anderson:
> Back in the early days when gru devices were still under development
> we found an issue where the WiFi reset line needed to be configured as
> early as possible during the boot process to avoid the WiFi module
> being in a bad state.
> 
> We found that the way to get the kernel to do this in the earliest
> possible place was to configure this line in the pinctrl hogs, so
> that's what we did.  For some history here you can see
> <http://crosreview.com/368770>.  After the time that change landed in
> the kernel, we landed a firmware change to configure this line even
> earlier.  See <http://crosreview.com/399919>.  However, even after the
> firmware change landed we kept the kernel change to deal with the fact
> that some people working on devices might take a little while to
> update their firmware.
> 
> At this there are definitely zero devices out in the wild that have
> firmware without the fix in it.  Specifically looking in the firmware
> branch several critically important fixes for memory stability landed
> after the patch in coreboot and I know we didn't ship without those.
> Thus, by now, everyone should have the new firmware and it's safe to
> not have the kernel set this up in a pinctrl hog.
> 
> Historically, even though it wasn't needed to have this in a pinctrl
> hog, we still kept it since it didn't hurt.  Pinctrl would apply the
> default hog at bootup and then would never touch things again.  That
> all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
> during suspend/resume").  After that commit then we'll re-apply the
> default hog at resume time and that can screw up the reset state of
> WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong way
> then the whole system can go haywire.  That's what was happening.
> Specifically you'd resume a rk3399-gru-* device and it would mostly
> resume, then would crash with some crazy weird crash.
> 
> One could say, perhaps, that the recent pinctrl change was at fault
> (and should be fixed) since it changed behavior.  ...but that's not
> really true.  The device tree for rk3399-gru is really to blame.
> Specifically since the pinctrl is defined in the hog and not in the
> "wlan-pd-n" node then the actual user of this pin doesn't have a
> pinctrl entry for it.  That's bad.
> 
> Let's fix our problems by just moving the control of
> "wlan_module_reset_l pinctrl" out of the hog and put them in the
> proper place.
> 
> NOTE: in theory, I think it should actually be possible to have a pin
> controlled _both_ by the hog and by an actual device.  Once the device
> claims the pin I think the hog is supposed to let go.  I'm not 100%
> sure that this works and in any case this solution would be more
> complex than is necessary.
> 
> Reported-by: Marc Zyngier <marc.zyngier@arm.com>
> Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
> Fixes: 981ed1bfbc6c ("pinctrl: Really force states during suspend/resume")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>

applied as fix for 4.16 with the 2 Tested-tags


Thanks
Heiko
Marc Zyngier March 6, 2018, 11:58 a.m. UTC | #5
Hi all,

On 01/03/18 08:43, Heiko Stübner wrote:
> Am Dienstag, 27. Februar 2018, 21:47:11 CET schrieb Douglas Anderson:
>> Back in the early days when gru devices were still under development
>> we found an issue where the WiFi reset line needed to be configured as
>> early as possible during the boot process to avoid the WiFi module
>> being in a bad state.
>>
>> We found that the way to get the kernel to do this in the earliest
>> possible place was to configure this line in the pinctrl hogs, so
>> that's what we did.  For some history here you can see
>> <http://crosreview.com/368770>.  After the time that change landed in
>> the kernel, we landed a firmware change to configure this line even
>> earlier.  See <http://crosreview.com/399919>.  However, even after the
>> firmware change landed we kept the kernel change to deal with the fact
>> that some people working on devices might take a little while to
>> update their firmware.
>>
>> At this there are definitely zero devices out in the wild that have
>> firmware without the fix in it.  Specifically looking in the firmware
>> branch several critically important fixes for memory stability landed
>> after the patch in coreboot and I know we didn't ship without those.
>> Thus, by now, everyone should have the new firmware and it's safe to
>> not have the kernel set this up in a pinctrl hog.
>>
>> Historically, even though it wasn't needed to have this in a pinctrl
>> hog, we still kept it since it didn't hurt.  Pinctrl would apply the
>> default hog at bootup and then would never touch things again.  That
>> all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
>> during suspend/resume").  After that commit then we'll re-apply the
>> default hog at resume time and that can screw up the reset state of
>> WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong way
>> then the whole system can go haywire.  That's what was happening.
>> Specifically you'd resume a rk3399-gru-* device and it would mostly
>> resume, then would crash with some crazy weird crash.
>>
>> One could say, perhaps, that the recent pinctrl change was at fault
>> (and should be fixed) since it changed behavior.  ...but that's not
>> really true.  The device tree for rk3399-gru is really to blame.
>> Specifically since the pinctrl is defined in the hog and not in the
>> "wlan-pd-n" node then the actual user of this pin doesn't have a
>> pinctrl entry for it.  That's bad.
>>
>> Let's fix our problems by just moving the control of
>> "wlan_module_reset_l pinctrl" out of the hog and put them in the
>> proper place.
>>
>> NOTE: in theory, I think it should actually be possible to have a pin
>> controlled _both_ by the hog and by an actual device.  Once the device
>> claims the pin I think the hog is supposed to let go.  I'm not 100%
>> sure that this works and in any case this solution would be more
>> complex than is necessary.
>>
>> Reported-by: Marc Zyngier <marc.zyngier@arm.com>
>> Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
>> Fixes: 981ed1bfbc6c ("pinctrl: Really force states during suspend/resume")
>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> 
> applied as fix for 4.16 with the 2 Tested-tags
Sorry to rain on everyone's parade, but further testing shows that this
patch may not be enough to restore a reliable s2r. My initial testing
did show that we were resuming without the VOP errors, but there seem to
be further issues (I'm loosing the keyboard and the trackpad after
resume on Kevin).

Applying my initial hack makes it work again. I suspect that there are
more hog pins that need tweaking, but I'm a bit out of my depth here.

Thanks,

	M.
Doug Anderson March 6, 2018, 6 p.m. UTC | #6
Hi,

On Tue, Mar 6, 2018 at 3:58 AM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> Hi all,
>
> On 01/03/18 08:43, Heiko Stübner wrote:
>> Am Dienstag, 27. Februar 2018, 21:47:11 CET schrieb Douglas Anderson:
>>> Back in the early days when gru devices were still under development
>>> we found an issue where the WiFi reset line needed to be configured as
>>> early as possible during the boot process to avoid the WiFi module
>>> being in a bad state.
>>>
>>> We found that the way to get the kernel to do this in the earliest
>>> possible place was to configure this line in the pinctrl hogs, so
>>> that's what we did.  For some history here you can see
>>> <http://crosreview.com/368770>.  After the time that change landed in
>>> the kernel, we landed a firmware change to configure this line even
>>> earlier.  See <http://crosreview.com/399919>.  However, even after the
>>> firmware change landed we kept the kernel change to deal with the fact
>>> that some people working on devices might take a little while to
>>> update their firmware.
>>>
>>> At this there are definitely zero devices out in the wild that have
>>> firmware without the fix in it.  Specifically looking in the firmware
>>> branch several critically important fixes for memory stability landed
>>> after the patch in coreboot and I know we didn't ship without those.
>>> Thus, by now, everyone should have the new firmware and it's safe to
>>> not have the kernel set this up in a pinctrl hog.
>>>
>>> Historically, even though it wasn't needed to have this in a pinctrl
>>> hog, we still kept it since it didn't hurt.  Pinctrl would apply the
>>> default hog at bootup and then would never touch things again.  That
>>> all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
>>> during suspend/resume").  After that commit then we'll re-apply the
>>> default hog at resume time and that can screw up the reset state of
>>> WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong way
>>> then the whole system can go haywire.  That's what was happening.
>>> Specifically you'd resume a rk3399-gru-* device and it would mostly
>>> resume, then would crash with some crazy weird crash.
>>>
>>> One could say, perhaps, that the recent pinctrl change was at fault
>>> (and should be fixed) since it changed behavior.  ...but that's not
>>> really true.  The device tree for rk3399-gru is really to blame.
>>> Specifically since the pinctrl is defined in the hog and not in the
>>> "wlan-pd-n" node then the actual user of this pin doesn't have a
>>> pinctrl entry for it.  That's bad.
>>>
>>> Let's fix our problems by just moving the control of
>>> "wlan_module_reset_l pinctrl" out of the hog and put them in the
>>> proper place.
>>>
>>> NOTE: in theory, I think it should actually be possible to have a pin
>>> controlled _both_ by the hog and by an actual device.  Once the device
>>> claims the pin I think the hog is supposed to let go.  I'm not 100%
>>> sure that this works and in any case this solution would be more
>>> complex than is necessary.
>>>
>>> Reported-by: Marc Zyngier <marc.zyngier@arm.com>
>>> Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
>>> Fixes: 981ed1bfbc6c ("pinctrl: Really force states during suspend/resume")
>>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>>
>> applied as fix for 4.16 with the 2 Tested-tags
> Sorry to rain on everyone's parade, but further testing shows that this
> patch may not be enough to restore a reliable s2r. My initial testing
> did show that we were resuming without the VOP errors, but there seem to
> be further issues (I'm loosing the keyboard and the trackpad after
> resume on Kevin).
>
> Applying my initial hack makes it work again. I suspect that there are
> more hog pins that need tweaking, but I'm a bit out of my depth here.

Are you positive you weren't just wearing your lucky hat when you
tested your patch and then took it off when you tested mine?  As far
as I can see the only hogs left on kevin are:

  &ap_pwroff      /* AP will auto-assert this when in S3 */
  &clk_32k        /* This pin is always 32k on gru boards */

Those map to:

  ap_pwroff: ap-pwroff {
     rockchip,pins = <1 5 RK_FUNC_1 &pcfg_pull_none>;
  };

  clk_32k: clk-32k {
    rockchip,pins = <0 0 RK_FUNC_2 &pcfg_pull_none>;
  };

So I added some printouts at suspend/resume time.  Specifically I set
a boolean to "true" for the duration rockchip_pinctrl_suspend() and
rockchip_pinctrl_resume() and this turned on a printout in
rockchip_set_mux().  My printout looked like this (yeah, I know it's a
whitespace-damaged patch just to show what I'm doing):

+       regmap_read(regmap, reg, &before);
        data = (mask << (bit + 16));
        rmask = data | (data >> 16);
        data |= (mux & mask) << bit;
        ret = regmap_update_bits(regmap, reg, rmask, data);

+       regmap_read(regmap, reg, &after);
+
+       if (DOUG) {
+               dev_info(info->dev,
+                        "setting mux of GPIO%d-%d to %d; %#010x=>%#010x\n",
+                        bank->bank_num, pin, mux, reg, before, after);
+       }

...and a similar one in rockchip_set_pull().  That showed this at resume time:

[   62.284427] rockchip-pinctrl pinctrl: setting mux of GPIO1-5 to 1;
0x00009400=>0x00009400
[   62.294423] rockchip-pinctrl pinctrl: setting pull of GPIO1-5;
0x000041aa=>0x000041aa
[   62.303343] rockchip-pinctrl pinctrl: setting mux of GPIO0-0 to 2;
0x00005002=>0x00005002
[   62.313240] rockchip-pinctrl pinctrl: setting pull of GPIO0-0;
0x00000ddc=>0x00000ddc
[

Said another way: pinmux and pull isn't actually changing due to the
hogs.  We can see if something else could be changing, but I'd really
want to be sure you're certain that the hogs are causing you
problems...


-Doug
Marc Zyngier March 6, 2018, 6:15 p.m. UTC | #7
Hi Doug,

On 06/03/18 18:00, Doug Anderson wrote:
> Hi,
> 
> On Tue, Mar 6, 2018 at 3:58 AM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> Hi all,
>>
>> On 01/03/18 08:43, Heiko Stübner wrote:
>>> Am Dienstag, 27. Februar 2018, 21:47:11 CET schrieb Douglas Anderson:
>>>> Back in the early days when gru devices were still under development
>>>> we found an issue where the WiFi reset line needed to be configured as
>>>> early as possible during the boot process to avoid the WiFi module
>>>> being in a bad state.
>>>>
>>>> We found that the way to get the kernel to do this in the earliest
>>>> possible place was to configure this line in the pinctrl hogs, so
>>>> that's what we did.  For some history here you can see
>>>> <http://crosreview.com/368770>.  After the time that change landed in
>>>> the kernel, we landed a firmware change to configure this line even
>>>> earlier.  See <http://crosreview.com/399919>.  However, even after the
>>>> firmware change landed we kept the kernel change to deal with the fact
>>>> that some people working on devices might take a little while to
>>>> update their firmware.
>>>>
>>>> At this there are definitely zero devices out in the wild that have
>>>> firmware without the fix in it.  Specifically looking in the firmware
>>>> branch several critically important fixes for memory stability landed
>>>> after the patch in coreboot and I know we didn't ship without those.
>>>> Thus, by now, everyone should have the new firmware and it's safe to
>>>> not have the kernel set this up in a pinctrl hog.
>>>>
>>>> Historically, even though it wasn't needed to have this in a pinctrl
>>>> hog, we still kept it since it didn't hurt.  Pinctrl would apply the
>>>> default hog at bootup and then would never touch things again.  That
>>>> all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
>>>> during suspend/resume").  After that commit then we'll re-apply the
>>>> default hog at resume time and that can screw up the reset state of
>>>> WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong way
>>>> then the whole system can go haywire.  That's what was happening.
>>>> Specifically you'd resume a rk3399-gru-* device and it would mostly
>>>> resume, then would crash with some crazy weird crash.
>>>>
>>>> One could say, perhaps, that the recent pinctrl change was at fault
>>>> (and should be fixed) since it changed behavior.  ...but that's not
>>>> really true.  The device tree for rk3399-gru is really to blame.
>>>> Specifically since the pinctrl is defined in the hog and not in the
>>>> "wlan-pd-n" node then the actual user of this pin doesn't have a
>>>> pinctrl entry for it.  That's bad.
>>>>
>>>> Let's fix our problems by just moving the control of
>>>> "wlan_module_reset_l pinctrl" out of the hog and put them in the
>>>> proper place.
>>>>
>>>> NOTE: in theory, I think it should actually be possible to have a pin
>>>> controlled _both_ by the hog and by an actual device.  Once the device
>>>> claims the pin I think the hog is supposed to let go.  I'm not 100%
>>>> sure that this works and in any case this solution would be more
>>>> complex than is necessary.
>>>>
>>>> Reported-by: Marc Zyngier <marc.zyngier@arm.com>
>>>> Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
>>>> Fixes: 981ed1bfbc6c ("pinctrl: Really force states during suspend/resume")
>>>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>>>
>>> applied as fix for 4.16 with the 2 Tested-tags
>> Sorry to rain on everyone's parade, but further testing shows that this
>> patch may not be enough to restore a reliable s2r. My initial testing
>> did show that we were resuming without the VOP errors, but there seem to
>> be further issues (I'm loosing the keyboard and the trackpad after
>> resume on Kevin).
>>
>> Applying my initial hack makes it work again. I suspect that there are
>> more hog pins that need tweaking, but I'm a bit out of my depth here.
> 
> Are you positive you weren't just wearing your lucky hat when you
> tested your patch and then took it off when you tested mine?  As far

So far, I seem to have a 100% success rate in resuming with my silly
hack, whist your DT patch alone only gives me a 50% rate at best.

> as I can see the only hogs left on kevin are:
> 
>   &ap_pwroff      /* AP will auto-assert this when in S3 */
>   &clk_32k        /* This pin is always 32k on gru boards */
> 
> Those map to:
> 
>   ap_pwroff: ap-pwroff {
>      rockchip,pins = <1 5 RK_FUNC_1 &pcfg_pull_none>;
>   };
> 
>   clk_32k: clk-32k {
>     rockchip,pins = <0 0 RK_FUNC_2 &pcfg_pull_none>;
>   };
> 
> So I added some printouts at suspend/resume time.  Specifically I set
> a boolean to "true" for the duration rockchip_pinctrl_suspend() and
> rockchip_pinctrl_resume() and this turned on a printout in
> rockchip_set_mux().  My printout looked like this (yeah, I know it's a
> whitespace-damaged patch just to show what I'm doing):
> 
> +       regmap_read(regmap, reg, &before);
>         data = (mask << (bit + 16));
>         rmask = data | (data >> 16);
>         data |= (mux & mask) << bit;
>         ret = regmap_update_bits(regmap, reg, rmask, data);
> 
> +       regmap_read(regmap, reg, &after);
> +
> +       if (DOUG) {
> +               dev_info(info->dev,
> +                        "setting mux of GPIO%d-%d to %d; %#010x=>%#010x\n",
> +                        bank->bank_num, pin, mux, reg, before, after);
> +       }
> 
> ...and a similar one in rockchip_set_pull().  That showed this at resume time:
> 
> [   62.284427] rockchip-pinctrl pinctrl: setting mux of GPIO1-5 to 1;
> 0x00009400=>0x00009400
> [   62.294423] rockchip-pinctrl pinctrl: setting pull of GPIO1-5;
> 0x000041aa=>0x000041aa
> [   62.303343] rockchip-pinctrl pinctrl: setting mux of GPIO0-0 to 2;
> 0x00005002=>0x00005002
> [   62.313240] rockchip-pinctrl pinctrl: setting pull of GPIO0-0;
> 0x00000ddc=>0x00000ddc
> [
> 
> Said another way: pinmux and pull isn't actually changing due to the
> hogs.  We can see if something else could be changing, but I'd really
> want to be sure you're certain that the hogs are causing you
> problems...

I cannot say for sure that the hogs are the issue. But I thought that
they were the only pins affected by 981ed1bfbc6c... If this patch can
affect other pins, then I'm probably barking up the wrong tree.

Thanks,

	M.
Heiko Stübner March 6, 2018, 6:49 p.m. UTC | #8
Am Dienstag, 6. März 2018, 19:15:18 CET schrieb Marc Zyngier:
> Hi Doug,
> 
> On 06/03/18 18:00, Doug Anderson wrote:
> > Hi,
> > 
> > On Tue, Mar 6, 2018 at 3:58 AM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> >> Hi all,
> >> 
> >> On 01/03/18 08:43, Heiko Stübner wrote:
> >>> Am Dienstag, 27. Februar 2018, 21:47:11 CET schrieb Douglas Anderson:
> >>>> Back in the early days when gru devices were still under development
> >>>> we found an issue where the WiFi reset line needed to be configured as
> >>>> early as possible during the boot process to avoid the WiFi module
> >>>> being in a bad state.
> >>>> 
> >>>> We found that the way to get the kernel to do this in the earliest
> >>>> possible place was to configure this line in the pinctrl hogs, so
> >>>> that's what we did.  For some history here you can see
> >>>> <http://crosreview.com/368770>.  After the time that change landed in
> >>>> the kernel, we landed a firmware change to configure this line even
> >>>> earlier.  See <http://crosreview.com/399919>.  However, even after the
> >>>> firmware change landed we kept the kernel change to deal with the fact
> >>>> that some people working on devices might take a little while to
> >>>> update their firmware.
> >>>> 
> >>>> At this there are definitely zero devices out in the wild that have
> >>>> firmware without the fix in it.  Specifically looking in the firmware
> >>>> branch several critically important fixes for memory stability landed
> >>>> after the patch in coreboot and I know we didn't ship without those.
> >>>> Thus, by now, everyone should have the new firmware and it's safe to
> >>>> not have the kernel set this up in a pinctrl hog.
> >>>> 
> >>>> Historically, even though it wasn't needed to have this in a pinctrl
> >>>> hog, we still kept it since it didn't hurt.  Pinctrl would apply the
> >>>> default hog at bootup and then would never touch things again.  That
> >>>> all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
> >>>> during suspend/resume").  After that commit then we'll re-apply the
> >>>> default hog at resume time and that can screw up the reset state of
> >>>> WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong way
> >>>> then the whole system can go haywire.  That's what was happening.
> >>>> Specifically you'd resume a rk3399-gru-* device and it would mostly
> >>>> resume, then would crash with some crazy weird crash.
> >>>> 
> >>>> One could say, perhaps, that the recent pinctrl change was at fault
> >>>> (and should be fixed) since it changed behavior.  ...but that's not
> >>>> really true.  The device tree for rk3399-gru is really to blame.
> >>>> Specifically since the pinctrl is defined in the hog and not in the
> >>>> "wlan-pd-n" node then the actual user of this pin doesn't have a
> >>>> pinctrl entry for it.  That's bad.
> >>>> 
> >>>> Let's fix our problems by just moving the control of
> >>>> "wlan_module_reset_l pinctrl" out of the hog and put them in the
> >>>> proper place.
> >>>> 
> >>>> NOTE: in theory, I think it should actually be possible to have a pin
> >>>> controlled _both_ by the hog and by an actual device.  Once the device
> >>>> claims the pin I think the hog is supposed to let go.  I'm not 100%
> >>>> sure that this works and in any case this solution would be more
> >>>> complex than is necessary.
> >>>> 
> >>>> Reported-by: Marc Zyngier <marc.zyngier@arm.com>
> >>>> Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
> >>>> Fixes: 981ed1bfbc6c ("pinctrl: Really force states during
> >>>> suspend/resume")
> >>>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> >>> 
> >>> applied as fix for 4.16 with the 2 Tested-tags
> >> 
> >> Sorry to rain on everyone's parade, but further testing shows that this
> >> patch may not be enough to restore a reliable s2r. My initial testing
> >> did show that we were resuming without the VOP errors, but there seem to
> >> be further issues (I'm loosing the keyboard and the trackpad after
> >> resume on Kevin).
> >> 
> >> Applying my initial hack makes it work again. I suspect that there are
> >> more hog pins that need tweaking, but I'm a bit out of my depth here.
> > 
> > Are you positive you weren't just wearing your lucky hat when you
> > tested your patch and then took it off when you tested mine?  As far
> 
> So far, I seem to have a 100% success rate in resuming with my silly
> hack, whist your DT patch alone only gives me a 50% rate at best.
> 
> > as I can see the only hogs left on kevin are:
> >   &ap_pwroff      /* AP will auto-assert this when in S3 */
> >   &clk_32k        /* This pin is always 32k on gru boards */
> > 
> > Those map to:
> >   ap_pwroff: ap-pwroff {
> >   
> >      rockchip,pins = <1 5 RK_FUNC_1 &pcfg_pull_none>;
> >   
> >   };
> >   
> >   clk_32k: clk-32k {
> >   
> >     rockchip,pins = <0 0 RK_FUNC_2 &pcfg_pull_none>;
> >   
> >   };
> > 
> > So I added some printouts at suspend/resume time.  Specifically I set
> > a boolean to "true" for the duration rockchip_pinctrl_suspend() and
> > rockchip_pinctrl_resume() and this turned on a printout in
> > rockchip_set_mux().  My printout looked like this (yeah, I know it's a
> > whitespace-damaged patch just to show what I'm doing):
> > 
> > +       regmap_read(regmap, reg, &before);
> > 
> >         data = (mask << (bit + 16));
> >         rmask = data | (data >> 16);
> >         data |= (mux & mask) << bit;
> >         ret = regmap_update_bits(regmap, reg, rmask, data);
> > 
> > +       regmap_read(regmap, reg, &after);
> > +
> > +       if (DOUG) {
> > +               dev_info(info->dev,
> > +                        "setting mux of GPIO%d-%d to %d;
> > %#010x=>%#010x\n", +                        bank->bank_num, pin, mux,
> > reg, before, after); +       }
> > 
> > ...and a similar one in rockchip_set_pull().  That showed this at resume
> > time:
> > 
> > [   62.284427] rockchip-pinctrl pinctrl: setting mux of GPIO1-5 to 1;
> > 0x00009400=>0x00009400
> > [   62.294423] rockchip-pinctrl pinctrl: setting pull of GPIO1-5;
> > 0x000041aa=>0x000041aa
> > [   62.303343] rockchip-pinctrl pinctrl: setting mux of GPIO0-0 to 2;
> > 0x00005002=>0x00005002
> > [   62.313240] rockchip-pinctrl pinctrl: setting pull of GPIO0-0;
> > 0x00000ddc=>0x00000ddc
> > [
> > 
> > Said another way: pinmux and pull isn't actually changing due to the
> > hogs.  We can see if something else could be changing, but I'd really
> > want to be sure you're certain that the hogs are causing you
> > problems...
> 
> I cannot say for sure that the hogs are the issue. But I thought that
> they were the only pins affected by 981ed1bfbc6c... If this patch can
> affect other pins, then I'm probably barking up the wrong tree.

On Kevin I see something like

[   60.764129] cros-ec-spi spi2.0: spi transfer failed: -108
[   60.764132] cros-ec-spi spi2.0: cs-deassert spi transfer failed: -108
[   60.764136] cros-ec-spi spi2.0: Command xfer error (err:-108)
[   60.764365] cros-ec-spi spi2.0: spi transfer failed: -108
[   60.764368] cros-ec-spi spi2.0: cs-deassert spi transfer failed: -108
[   60.764371] cros-ec-spi spi2.0: Command xfer error (err:-108)

on resume with my current for-next. So maybe your hack just
happened to change some timing during resume?

Suspend/Resume also disconnects my usb-ethernet, making me lose my
nfsroot, so I can test this once every boot only.
Marc Zyngier March 6, 2018, 6:58 p.m. UTC | #9
On 06/03/18 18:49, Heiko Stübner wrote:
> Am Dienstag, 6. März 2018, 19:15:18 CET schrieb Marc Zyngier:
>> Hi Doug,
>>
>> On 06/03/18 18:00, Doug Anderson wrote:
>>> Hi,
>>>
>>> On Tue, Mar 6, 2018 at 3:58 AM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>> Hi all,
>>>>
>>>> On 01/03/18 08:43, Heiko Stübner wrote:
>>>>> Am Dienstag, 27. Februar 2018, 21:47:11 CET schrieb Douglas Anderson:
>>>>>> Back in the early days when gru devices were still under development
>>>>>> we found an issue where the WiFi reset line needed to be configured as
>>>>>> early as possible during the boot process to avoid the WiFi module
>>>>>> being in a bad state.
>>>>>>
>>>>>> We found that the way to get the kernel to do this in the earliest
>>>>>> possible place was to configure this line in the pinctrl hogs, so
>>>>>> that's what we did.  For some history here you can see
>>>>>> <http://crosreview.com/368770>.  After the time that change landed in
>>>>>> the kernel, we landed a firmware change to configure this line even
>>>>>> earlier.  See <http://crosreview.com/399919>.  However, even after the
>>>>>> firmware change landed we kept the kernel change to deal with the fact
>>>>>> that some people working on devices might take a little while to
>>>>>> update their firmware.
>>>>>>
>>>>>> At this there are definitely zero devices out in the wild that have
>>>>>> firmware without the fix in it.  Specifically looking in the firmware
>>>>>> branch several critically important fixes for memory stability landed
>>>>>> after the patch in coreboot and I know we didn't ship without those.
>>>>>> Thus, by now, everyone should have the new firmware and it's safe to
>>>>>> not have the kernel set this up in a pinctrl hog.
>>>>>>
>>>>>> Historically, even though it wasn't needed to have this in a pinctrl
>>>>>> hog, we still kept it since it didn't hurt.  Pinctrl would apply the
>>>>>> default hog at bootup and then would never touch things again.  That
>>>>>> all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
>>>>>> during suspend/resume").  After that commit then we'll re-apply the
>>>>>> default hog at resume time and that can screw up the reset state of
>>>>>> WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong way
>>>>>> then the whole system can go haywire.  That's what was happening.
>>>>>> Specifically you'd resume a rk3399-gru-* device and it would mostly
>>>>>> resume, then would crash with some crazy weird crash.
>>>>>>
>>>>>> One could say, perhaps, that the recent pinctrl change was at fault
>>>>>> (and should be fixed) since it changed behavior.  ...but that's not
>>>>>> really true.  The device tree for rk3399-gru is really to blame.
>>>>>> Specifically since the pinctrl is defined in the hog and not in the
>>>>>> "wlan-pd-n" node then the actual user of this pin doesn't have a
>>>>>> pinctrl entry for it.  That's bad.
>>>>>>
>>>>>> Let's fix our problems by just moving the control of
>>>>>> "wlan_module_reset_l pinctrl" out of the hog and put them in the
>>>>>> proper place.
>>>>>>
>>>>>> NOTE: in theory, I think it should actually be possible to have a pin
>>>>>> controlled _both_ by the hog and by an actual device.  Once the device
>>>>>> claims the pin I think the hog is supposed to let go.  I'm not 100%
>>>>>> sure that this works and in any case this solution would be more
>>>>>> complex than is necessary.
>>>>>>
>>>>>> Reported-by: Marc Zyngier <marc.zyngier@arm.com>
>>>>>> Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
>>>>>> Fixes: 981ed1bfbc6c ("pinctrl: Really force states during
>>>>>> suspend/resume")
>>>>>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>>>>>
>>>>> applied as fix for 4.16 with the 2 Tested-tags
>>>>
>>>> Sorry to rain on everyone's parade, but further testing shows that this
>>>> patch may not be enough to restore a reliable s2r. My initial testing
>>>> did show that we were resuming without the VOP errors, but there seem to
>>>> be further issues (I'm loosing the keyboard and the trackpad after
>>>> resume on Kevin).
>>>>
>>>> Applying my initial hack makes it work again. I suspect that there are
>>>> more hog pins that need tweaking, but I'm a bit out of my depth here.
>>>
>>> Are you positive you weren't just wearing your lucky hat when you
>>> tested your patch and then took it off when you tested mine?  As far
>>
>> So far, I seem to have a 100% success rate in resuming with my silly
>> hack, whist your DT patch alone only gives me a 50% rate at best.
>>
>>> as I can see the only hogs left on kevin are:
>>>   &ap_pwroff      /* AP will auto-assert this when in S3 */
>>>   &clk_32k        /* This pin is always 32k on gru boards */
>>>
>>> Those map to:
>>>   ap_pwroff: ap-pwroff {
>>>   
>>>      rockchip,pins = <1 5 RK_FUNC_1 &pcfg_pull_none>;
>>>   
>>>   };
>>>   
>>>   clk_32k: clk-32k {
>>>   
>>>     rockchip,pins = <0 0 RK_FUNC_2 &pcfg_pull_none>;
>>>   
>>>   };
>>>
>>> So I added some printouts at suspend/resume time.  Specifically I set
>>> a boolean to "true" for the duration rockchip_pinctrl_suspend() and
>>> rockchip_pinctrl_resume() and this turned on a printout in
>>> rockchip_set_mux().  My printout looked like this (yeah, I know it's a
>>> whitespace-damaged patch just to show what I'm doing):
>>>
>>> +       regmap_read(regmap, reg, &before);
>>>
>>>         data = (mask << (bit + 16));
>>>         rmask = data | (data >> 16);
>>>         data |= (mux & mask) << bit;
>>>         ret = regmap_update_bits(regmap, reg, rmask, data);
>>>
>>> +       regmap_read(regmap, reg, &after);
>>> +
>>> +       if (DOUG) {
>>> +               dev_info(info->dev,
>>> +                        "setting mux of GPIO%d-%d to %d;
>>> %#010x=>%#010x\n", +                        bank->bank_num, pin, mux,
>>> reg, before, after); +       }
>>>
>>> ...and a similar one in rockchip_set_pull().  That showed this at resume
>>> time:
>>>
>>> [   62.284427] rockchip-pinctrl pinctrl: setting mux of GPIO1-5 to 1;
>>> 0x00009400=>0x00009400
>>> [   62.294423] rockchip-pinctrl pinctrl: setting pull of GPIO1-5;
>>> 0x000041aa=>0x000041aa
>>> [   62.303343] rockchip-pinctrl pinctrl: setting mux of GPIO0-0 to 2;
>>> 0x00005002=>0x00005002
>>> [   62.313240] rockchip-pinctrl pinctrl: setting pull of GPIO0-0;
>>> 0x00000ddc=>0x00000ddc
>>> [
>>>
>>> Said another way: pinmux and pull isn't actually changing due to the
>>> hogs.  We can see if something else could be changing, but I'd really
>>> want to be sure you're certain that the hogs are causing you
>>> problems...
>>
>> I cannot say for sure that the hogs are the issue. But I thought that
>> they were the only pins affected by 981ed1bfbc6c... If this patch can
>> affect other pins, then I'm probably barking up the wrong tree.
> 
> On Kevin I see something like
> 
> [   60.764129] cros-ec-spi spi2.0: spi transfer failed: -108
> [   60.764132] cros-ec-spi spi2.0: cs-deassert spi transfer failed: -108
> [   60.764136] cros-ec-spi spi2.0: Command xfer error (err:-108)
> [   60.764365] cros-ec-spi spi2.0: spi transfer failed: -108
> [   60.764368] cros-ec-spi spi2.0: cs-deassert spi transfer failed: -108
> [   60.764371] cros-ec-spi spi2.0: Command xfer error (err:-108)
> 
> on resume with my current for-next. So maybe your hack just
> happened to change some timing during resume?

No, I carry yet another patch to make that one work[1].

> Suspend/Resume also disconnects my usb-ethernet, making me lose my
> nfsroot, so I can test this once every boot only.

I use my kevin as a "real" laptop, which means it gets suspended/resumed 
at least 20 times a day, no reboots involved (unless I'm actually testing
arm64 code on it).

	M.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=hack/kevin-4.16&id=4e95dc697ec6b66579f8747c917dfe675bb771b2
Doug Anderson March 7, 2018, 4:54 a.m. UTC | #10
Hi,

On Tue, Mar 6, 2018 at 10:58 AM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 06/03/18 18:49, Heiko Stübner wrote:
>> Am Dienstag, 6. März 2018, 19:15:18 CET schrieb Marc Zyngier:
>>> Hi Doug,
>>>
>>> On 06/03/18 18:00, Doug Anderson wrote:
>>>> Hi,
>>>>
>>>> On Tue, Mar 6, 2018 at 3:58 AM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>>> Hi all,
>>>>>
>>>>> On 01/03/18 08:43, Heiko Stübner wrote:
>>>>>> Am Dienstag, 27. Februar 2018, 21:47:11 CET schrieb Douglas Anderson:
>>>>>>> Back in the early days when gru devices were still under development
>>>>>>> we found an issue where the WiFi reset line needed to be configured as
>>>>>>> early as possible during the boot process to avoid the WiFi module
>>>>>>> being in a bad state.
>>>>>>>
>>>>>>> We found that the way to get the kernel to do this in the earliest
>>>>>>> possible place was to configure this line in the pinctrl hogs, so
>>>>>>> that's what we did.  For some history here you can see
>>>>>>> <http://crosreview.com/368770>.  After the time that change landed in
>>>>>>> the kernel, we landed a firmware change to configure this line even
>>>>>>> earlier.  See <http://crosreview.com/399919>.  However, even after the
>>>>>>> firmware change landed we kept the kernel change to deal with the fact
>>>>>>> that some people working on devices might take a little while to
>>>>>>> update their firmware.
>>>>>>>
>>>>>>> At this there are definitely zero devices out in the wild that have
>>>>>>> firmware without the fix in it.  Specifically looking in the firmware
>>>>>>> branch several critically important fixes for memory stability landed
>>>>>>> after the patch in coreboot and I know we didn't ship without those.
>>>>>>> Thus, by now, everyone should have the new firmware and it's safe to
>>>>>>> not have the kernel set this up in a pinctrl hog.
>>>>>>>
>>>>>>> Historically, even though it wasn't needed to have this in a pinctrl
>>>>>>> hog, we still kept it since it didn't hurt.  Pinctrl would apply the
>>>>>>> default hog at bootup and then would never touch things again.  That
>>>>>>> all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
>>>>>>> during suspend/resume").  After that commit then we'll re-apply the
>>>>>>> default hog at resume time and that can screw up the reset state of
>>>>>>> WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong way
>>>>>>> then the whole system can go haywire.  That's what was happening.
>>>>>>> Specifically you'd resume a rk3399-gru-* device and it would mostly
>>>>>>> resume, then would crash with some crazy weird crash.
>>>>>>>
>>>>>>> One could say, perhaps, that the recent pinctrl change was at fault
>>>>>>> (and should be fixed) since it changed behavior.  ...but that's not
>>>>>>> really true.  The device tree for rk3399-gru is really to blame.
>>>>>>> Specifically since the pinctrl is defined in the hog and not in the
>>>>>>> "wlan-pd-n" node then the actual user of this pin doesn't have a
>>>>>>> pinctrl entry for it.  That's bad.
>>>>>>>
>>>>>>> Let's fix our problems by just moving the control of
>>>>>>> "wlan_module_reset_l pinctrl" out of the hog and put them in the
>>>>>>> proper place.
>>>>>>>
>>>>>>> NOTE: in theory, I think it should actually be possible to have a pin
>>>>>>> controlled _both_ by the hog and by an actual device.  Once the device
>>>>>>> claims the pin I think the hog is supposed to let go.  I'm not 100%
>>>>>>> sure that this works and in any case this solution would be more
>>>>>>> complex than is necessary.
>>>>>>>
>>>>>>> Reported-by: Marc Zyngier <marc.zyngier@arm.com>
>>>>>>> Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
>>>>>>> Fixes: 981ed1bfbc6c ("pinctrl: Really force states during
>>>>>>> suspend/resume")
>>>>>>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>>>>>>
>>>>>> applied as fix for 4.16 with the 2 Tested-tags
>>>>>
>>>>> Sorry to rain on everyone's parade, but further testing shows that this
>>>>> patch may not be enough to restore a reliable s2r. My initial testing
>>>>> did show that we were resuming without the VOP errors, but there seem to
>>>>> be further issues (I'm loosing the keyboard and the trackpad after
>>>>> resume on Kevin).
>>>>>
>>>>> Applying my initial hack makes it work again. I suspect that there are
>>>>> more hog pins that need tweaking, but I'm a bit out of my depth here.
>>>>
>>>> Are you positive you weren't just wearing your lucky hat when you
>>>> tested your patch and then took it off when you tested mine?  As far
>>>
>>> So far, I seem to have a 100% success rate in resuming with my silly
>>> hack, whist your DT patch alone only gives me a 50% rate at best.
>>>
>>>> as I can see the only hogs left on kevin are:
>>>>   &ap_pwroff      /* AP will auto-assert this when in S3 */
>>>>   &clk_32k        /* This pin is always 32k on gru boards */
>>>>
>>>> Those map to:
>>>>   ap_pwroff: ap-pwroff {
>>>>
>>>>      rockchip,pins = <1 5 RK_FUNC_1 &pcfg_pull_none>;
>>>>
>>>>   };
>>>>
>>>>   clk_32k: clk-32k {
>>>>
>>>>     rockchip,pins = <0 0 RK_FUNC_2 &pcfg_pull_none>;
>>>>
>>>>   };
>>>>
>>>> So I added some printouts at suspend/resume time.  Specifically I set
>>>> a boolean to "true" for the duration rockchip_pinctrl_suspend() and
>>>> rockchip_pinctrl_resume() and this turned on a printout in
>>>> rockchip_set_mux().  My printout looked like this (yeah, I know it's a
>>>> whitespace-damaged patch just to show what I'm doing):
>>>>
>>>> +       regmap_read(regmap, reg, &before);
>>>>
>>>>         data = (mask << (bit + 16));
>>>>         rmask = data | (data >> 16);
>>>>         data |= (mux & mask) << bit;
>>>>         ret = regmap_update_bits(regmap, reg, rmask, data);
>>>>
>>>> +       regmap_read(regmap, reg, &after);
>>>> +
>>>> +       if (DOUG) {
>>>> +               dev_info(info->dev,
>>>> +                        "setting mux of GPIO%d-%d to %d;
>>>> %#010x=>%#010x\n", +                        bank->bank_num, pin, mux,
>>>> reg, before, after); +       }
>>>>
>>>> ...and a similar one in rockchip_set_pull().  That showed this at resume
>>>> time:
>>>>
>>>> [   62.284427] rockchip-pinctrl pinctrl: setting mux of GPIO1-5 to 1;
>>>> 0x00009400=>0x00009400
>>>> [   62.294423] rockchip-pinctrl pinctrl: setting pull of GPIO1-5;
>>>> 0x000041aa=>0x000041aa
>>>> [   62.303343] rockchip-pinctrl pinctrl: setting mux of GPIO0-0 to 2;
>>>> 0x00005002=>0x00005002
>>>> [   62.313240] rockchip-pinctrl pinctrl: setting pull of GPIO0-0;
>>>> 0x00000ddc=>0x00000ddc
>>>> [
>>>>
>>>> Said another way: pinmux and pull isn't actually changing due to the
>>>> hogs.  We can see if something else could be changing, but I'd really
>>>> want to be sure you're certain that the hogs are causing you
>>>> problems...
>>>
>>> I cannot say for sure that the hogs are the issue. But I thought that
>>> they were the only pins affected by 981ed1bfbc6c... If this patch can
>>> affect other pins, then I'm probably barking up the wrong tree.
>>
>> On Kevin I see something like
>>
>> [   60.764129] cros-ec-spi spi2.0: spi transfer failed: -108
>> [   60.764132] cros-ec-spi spi2.0: cs-deassert spi transfer failed: -108
>> [   60.764136] cros-ec-spi spi2.0: Command xfer error (err:-108)
>> [   60.764365] cros-ec-spi spi2.0: spi transfer failed: -108
>> [   60.764368] cros-ec-spi spi2.0: cs-deassert spi transfer failed: -108
>> [   60.764371] cros-ec-spi spi2.0: Command xfer error (err:-108)
>>
>> on resume with my current for-next. So maybe your hack just
>> happened to change some timing during resume?
>
> No, I carry yet another patch to make that one work[1].
>
>> Suspend/Resume also disconnects my usb-ethernet, making me lose my
>> nfsroot, so I can test this once every boot only.

Yeah, it kills usb-ethernet for me too.  ...so I can suspend/resume
once and then the next suspend fails with a bunch of usb errors.  This
is based on your "hack/kevin-4.16", which looks like:

e7934b797f4b (HEAD, linux_arm-platforms/hack/kevin-4.16) arm64: dts:
rockchip: Fix rk3399-gru-* s2r (pinctrl hogs, wifi reset)
9b9988414f44 irqchip/gic-v3: Allow LPIs to be disabled from the command line
d5e06c686858 iommu/rockchip: Perform a reset on shutdown
55b36a99626f drm/rockchip: Don't use spin_lock_irqsave in interrupt context
7d72d7e57c2c drm/rockchip: Do not use memcpy for MMIO addresses
236afcd0425c drm/rockchip: Clear all interrupts before requesting the IRQ
31608ae0d3fc arm64: Enable dynamic sched_domain flag setting
1b255643cdc3 drivers/base/arch_topology: Dynamic sched_domain flag detection
2caca1b31f89 arm64: rk3399: Add capacity-dmips-mhz attributes
36ced612e4d3 mfd: cros_ec: add RTC as mfd subdevice
4e95dc697ec6 spi: rockchip: Convert to late and early system PM callbacks
d27d6da92086 drm/rockchip: analogix_dp: Ensure that the bridge is
powered before poking it
bcce86412ec1 arm64: DT: rk3399: Add missing EDP clock
c04436bd57b8 bootloader cmdline
bb8a4d168d58 build hacks
26e04c84de6c kevin: build stuff
6915015e30db kevin: defconfig
4a3928c6f8a5 (tag: v4.16-rc3) Linux 4.16-rc3


Actually, I also see some errors reading thermal channels, but I
wonder if perhaps USB is causing some sort of interrupt storm and that
happens to randomly take out whatever device was trying to talk at the
same time?

[   83.718999] read channel() error: -110
[   83.723275] thermal thermal_zone3: failed to read out thermal zone (-110)
[   83.822810] read channel() error: -110
[   83.827048] thermal thermal_zone2: failed to read out thermal zone (-110)
[   84.862810] read channel() error: -110
[   84.867051] thermal thermal_zone3: failed to read out thermal zone (-110)
[   84.990800] read channel() error: -110
[   84.995034] thermal thermal_zone2: failed to read out thermal zone (-110)
[   86.110817] read channel() error: -110
[   86.115259] thermal thermal_zone3: failed to read out thermal zone (-110)
[   86.214990] read channel() error: -110
[   86.219440] thermal thermal_zone2: failed to read out thermal zone (-110)
[   87.295049] read channel() error: -110
[   87.299316] thermal thermal_zone3: failed to read out thermal zone (-110)
[   87.398805] read channel() error: -110
[   87.403040] thermal thermal_zone2: failed to read out thermal zone (-110)
[   88.510808] read channel() error: -110
[   88.515053] thermal thermal_zone2: failed to read out thermal zone (-110)
[   88.646810] read channel() error: -110
[   88.651250] thermal thermal_zone3: failed to read out thermal zone (-110)
[   89.874606] xhci-hcd xhci-hcd.2.auto: Abort failed to stop command ring: -110
[   89.896134] xhci-hcd xhci-hcd.2.auto: xHCI host controller not
responding, assume dead
[   89.905130] xhci-hcd xhci-hcd.2.auto: HC died; cleaning up
[   89.911491] xhci-hcd xhci-hcd.2.auto: Timeout while waiting for
configure endpoint command


> I use my kevin as a "real" laptop, which means it gets suspended/resumed
> at least 20 times a day, no reboots involved (unless I'm actually testing
> arm64 code on it).

Enric: I know you've been working with Kevin stuff a lot.  Any chance
you reproduce Marc's failures and also see that it's fixed with his
hack patch?

Marc: have you posted actual logs for the failing case (after picking
my dts fix) somewhere?


Unfortunately, I don't think I'll be able to devote a ton more time to
debugging this right now.  :(

-Doug
Heiko Stübner March 12, 2018, 1:31 p.m. UTC | #11
Am Mittwoch, 7. März 2018, 05:54:32 CET schrieb Doug Anderson:
> Hi,
> 
> On Tue, Mar 6, 2018 at 10:58 AM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> > On 06/03/18 18:49, Heiko Stübner wrote:
> >> Am Dienstag, 6. März 2018, 19:15:18 CET schrieb Marc Zyngier:
> >>> Hi Doug,
> >>> 
> >>> On 06/03/18 18:00, Doug Anderson wrote:
> >>>> Hi,
> >>>> 
> >>>> On Tue, Mar 6, 2018 at 3:58 AM, Marc Zyngier <marc.zyngier@arm.com> 
wrote:
> >>>>> Hi all,
> >>>>> 
> >>>>> On 01/03/18 08:43, Heiko Stübner wrote:
> >>>>>> Am Dienstag, 27. Februar 2018, 21:47:11 CET schrieb Douglas Anderson:
> >>>>>>> Back in the early days when gru devices were still under development
> >>>>>>> we found an issue where the WiFi reset line needed to be configured
> >>>>>>> as
> >>>>>>> early as possible during the boot process to avoid the WiFi module
> >>>>>>> being in a bad state.
> >>>>>>> 
> >>>>>>> We found that the way to get the kernel to do this in the earliest
> >>>>>>> possible place was to configure this line in the pinctrl hogs, so
> >>>>>>> that's what we did.  For some history here you can see
> >>>>>>> <http://crosreview.com/368770>.  After the time that change landed
> >>>>>>> in
> >>>>>>> the kernel, we landed a firmware change to configure this line even
> >>>>>>> earlier.  See <http://crosreview.com/399919>.  However, even after
> >>>>>>> the
> >>>>>>> firmware change landed we kept the kernel change to deal with the
> >>>>>>> fact
> >>>>>>> that some people working on devices might take a little while to
> >>>>>>> update their firmware.
> >>>>>>> 
> >>>>>>> At this there are definitely zero devices out in the wild that have
> >>>>>>> firmware without the fix in it.  Specifically looking in the
> >>>>>>> firmware
> >>>>>>> branch several critically important fixes for memory stability
> >>>>>>> landed
> >>>>>>> after the patch in coreboot and I know we didn't ship without those.
> >>>>>>> Thus, by now, everyone should have the new firmware and it's safe to
> >>>>>>> not have the kernel set this up in a pinctrl hog.
> >>>>>>> 
> >>>>>>> Historically, even though it wasn't needed to have this in a pinctrl
> >>>>>>> hog, we still kept it since it didn't hurt.  Pinctrl would apply the
> >>>>>>> default hog at bootup and then would never touch things again.  That
> >>>>>>> all changed with commit 981ed1bfbc6c ("pinctrl: Really force states
> >>>>>>> during suspend/resume").  After that commit then we'll re-apply the
> >>>>>>> default hog at resume time and that can screw up the reset state of
> >>>>>>> WiFi.  ...and on rk3399 if you touch a device on PCIe in the wrong
> >>>>>>> way
> >>>>>>> then the whole system can go haywire.  That's what was happening.
> >>>>>>> Specifically you'd resume a rk3399-gru-* device and it would mostly
> >>>>>>> resume, then would crash with some crazy weird crash.
> >>>>>>> 
> >>>>>>> One could say, perhaps, that the recent pinctrl change was at fault
> >>>>>>> (and should be fixed) since it changed behavior.  ...but that's not
> >>>>>>> really true.  The device tree for rk3399-gru is really to blame.
> >>>>>>> Specifically since the pinctrl is defined in the hog and not in the
> >>>>>>> "wlan-pd-n" node then the actual user of this pin doesn't have a
> >>>>>>> pinctrl entry for it.  That's bad.
> >>>>>>> 
> >>>>>>> Let's fix our problems by just moving the control of
> >>>>>>> "wlan_module_reset_l pinctrl" out of the hog and put them in the
> >>>>>>> proper place.
> >>>>>>> 
> >>>>>>> NOTE: in theory, I think it should actually be possible to have a
> >>>>>>> pin
> >>>>>>> controlled _both_ by the hog and by an actual device.  Once the
> >>>>>>> device
> >>>>>>> claims the pin I think the hog is supposed to let go.  I'm not 100%
> >>>>>>> sure that this works and in any case this solution would be more
> >>>>>>> complex than is necessary.
> >>>>>>> 
> >>>>>>> Reported-by: Marc Zyngier <marc.zyngier@arm.com>
> >>>>>>> Fixes: 48f4d9796d99 ("arm64: dts: rockchip: add Gru/Kevin DTS")
> >>>>>>> Fixes: 981ed1bfbc6c ("pinctrl: Really force states during
> >>>>>>> suspend/resume")
> >>>>>>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> >>>>>> 
> >>>>>> applied as fix for 4.16 with the 2 Tested-tags
> >>>>> 
> >>>>> Sorry to rain on everyone's parade, but further testing shows that
> >>>>> this
> >>>>> patch may not be enough to restore a reliable s2r. My initial testing
> >>>>> did show that we were resuming without the VOP errors, but there seem
> >>>>> to
> >>>>> be further issues (I'm loosing the keyboard and the trackpad after
> >>>>> resume on Kevin).
> >>>>> 
> >>>>> Applying my initial hack makes it work again. I suspect that there are
> >>>>> more hog pins that need tweaking, but I'm a bit out of my depth here.
> >>>> 
> >>>> Are you positive you weren't just wearing your lucky hat when you
> >>>> tested your patch and then took it off when you tested mine?  As far
> >>> 
> >>> So far, I seem to have a 100% success rate in resuming with my silly
> >>> hack, whist your DT patch alone only gives me a 50% rate at best.
> >>> 
> >>>> as I can see the only hogs left on kevin are:
> >>>>   &ap_pwroff      /* AP will auto-assert this when in S3 */
> >>>>   &clk_32k        /* This pin is always 32k on gru boards */
> >>>> 
> >>>> Those map to:
> >>>>   ap_pwroff: ap-pwroff {
> >>>>   
> >>>>      rockchip,pins = <1 5 RK_FUNC_1 &pcfg_pull_none>;
> >>>>   
> >>>>   };
> >>>>   
> >>>>   clk_32k: clk-32k {
> >>>>   
> >>>>     rockchip,pins = <0 0 RK_FUNC_2 &pcfg_pull_none>;
> >>>>   
> >>>>   };
> >>>> 
> >>>> So I added some printouts at suspend/resume time.  Specifically I set
> >>>> a boolean to "true" for the duration rockchip_pinctrl_suspend() and
> >>>> rockchip_pinctrl_resume() and this turned on a printout in
> >>>> rockchip_set_mux().  My printout looked like this (yeah, I know it's a
> >>>> whitespace-damaged patch just to show what I'm doing):
> >>>> 
> >>>> +       regmap_read(regmap, reg, &before);
> >>>> 
> >>>>         data = (mask << (bit + 16));
> >>>>         rmask = data | (data >> 16);
> >>>>         data |= (mux & mask) << bit;
> >>>>         ret = regmap_update_bits(regmap, reg, rmask, data);
> >>>> 
> >>>> +       regmap_read(regmap, reg, &after);
> >>>> +
> >>>> +       if (DOUG) {
> >>>> +               dev_info(info->dev,
> >>>> +                        "setting mux of GPIO%d-%d to %d;
> >>>> %#010x=>%#010x\n", +                        bank->bank_num, pin, mux,
> >>>> reg, before, after); +       }
> >>>> 
> >>>> ...and a similar one in rockchip_set_pull().  That showed this at
> >>>> resume
> >>>> time:
> >>>> 
> >>>> [   62.284427] rockchip-pinctrl pinctrl: setting mux of GPIO1-5 to 1;
> >>>> 0x00009400=>0x00009400
> >>>> [   62.294423] rockchip-pinctrl pinctrl: setting pull of GPIO1-5;
> >>>> 0x000041aa=>0x000041aa
> >>>> [   62.303343] rockchip-pinctrl pinctrl: setting mux of GPIO0-0 to 2;
> >>>> 0x00005002=>0x00005002
> >>>> [   62.313240] rockchip-pinctrl pinctrl: setting pull of GPIO0-0;
> >>>> 0x00000ddc=>0x00000ddc
> >>>> [
> >>>> 
> >>>> Said another way: pinmux and pull isn't actually changing due to the
> >>>> hogs.  We can see if something else could be changing, but I'd really
> >>>> want to be sure you're certain that the hogs are causing you
> >>>> problems...
> >>> 
> >>> I cannot say for sure that the hogs are the issue. But I thought that
> >>> they were the only pins affected by 981ed1bfbc6c... If this patch can
> >>> affect other pins, then I'm probably barking up the wrong tree.
> >> 
> >> On Kevin I see something like
> >> 
> >> [   60.764129] cros-ec-spi spi2.0: spi transfer failed: -108
> >> [   60.764132] cros-ec-spi spi2.0: cs-deassert spi transfer failed: -108
> >> [   60.764136] cros-ec-spi spi2.0: Command xfer error (err:-108)
> >> [   60.764365] cros-ec-spi spi2.0: spi transfer failed: -108
> >> [   60.764368] cros-ec-spi spi2.0: cs-deassert spi transfer failed: -108
> >> [   60.764371] cros-ec-spi spi2.0: Command xfer error (err:-108)
> >> 
> >> on resume with my current for-next. So maybe your hack just
> >> happened to change some timing during resume?
> > 
> > No, I carry yet another patch to make that one work[1].
> > 
> >> Suspend/Resume also disconnects my usb-ethernet, making me lose my
> >> nfsroot, so I can test this once every boot only.
> 
> Yeah, it kills usb-ethernet for me too.  ...so I can suspend/resume
> once and then the next suspend fails with a bunch of usb errors.  This
> is based on your "hack/kevin-4.16", which looks like:
> 
> e7934b797f4b (HEAD, linux_arm-platforms/hack/kevin-4.16) arm64: dts:
> rockchip: Fix rk3399-gru-* s2r (pinctrl hogs, wifi reset)
> 9b9988414f44 irqchip/gic-v3: Allow LPIs to be disabled from the command line
> d5e06c686858 iommu/rockchip: Perform a reset on shutdown
> 55b36a99626f drm/rockchip: Don't use spin_lock_irqsave in interrupt context
> 7d72d7e57c2c drm/rockchip: Do not use memcpy for MMIO addresses
> 236afcd0425c drm/rockchip: Clear all interrupts before requesting the IRQ
> 31608ae0d3fc arm64: Enable dynamic sched_domain flag setting
> 1b255643cdc3 drivers/base/arch_topology: Dynamic sched_domain flag detection
> 2caca1b31f89 arm64: rk3399: Add capacity-dmips-mhz attributes
> 36ced612e4d3 mfd: cros_ec: add RTC as mfd subdevice
> 4e95dc697ec6 spi: rockchip: Convert to late and early system PM callbacks
> d27d6da92086 drm/rockchip: analogix_dp: Ensure that the bridge is
> powered before poking it
> bcce86412ec1 arm64: DT: rk3399: Add missing EDP clock
> c04436bd57b8 bootloader cmdline
> bb8a4d168d58 build hacks
> 26e04c84de6c kevin: build stuff
> 6915015e30db kevin: defconfig
> 4a3928c6f8a5 (tag: v4.16-rc3) Linux 4.16-rc3
> 
> 
> Actually, I also see some errors reading thermal channels, but I
> wonder if perhaps USB is causing some sort of interrupt storm and that
> happens to randomly take out whatever device was trying to talk at the
> same time?
> 
> [   83.718999] read channel() error: -110
> [   83.723275] thermal thermal_zone3: failed to read out thermal zone (-110)
> [   83.822810] read channel() error: -110
> [   83.827048] thermal thermal_zone2: failed to read out thermal zone (-110)
> [   84.862810] read channel() error: -110
> [   84.867051] thermal thermal_zone3: failed to read out thermal zone (-110)
> [   84.990800] read channel() error: -110
> [   84.995034] thermal thermal_zone2: failed to read out thermal zone (-110)
> [   86.110817] read channel() error: -110
> [   86.115259] thermal thermal_zone3: failed to read out thermal zone (-110)
> [   86.214990] read channel() error: -110
> [   86.219440] thermal thermal_zone2: failed to read out thermal zone (-110)
> [   87.295049] read channel() error: -110
> [   87.299316] thermal thermal_zone3: failed to read out thermal zone (-110)
> [   87.398805] read channel() error: -110
> [   87.403040] thermal thermal_zone2: failed to read out thermal zone (-110)
> [   88.510808] read channel() error: -110
> [   88.515053] thermal thermal_zone2: failed to read out thermal zone (-110)
> [   88.646810] read channel() error: -110
> [   88.651250] thermal thermal_zone3: failed to read out thermal zone (-110)
> [   89.874606] xhci-hcd xhci-hcd.2.auto: Abort failed to stop command ring:
> -110 [   89.896134] xhci-hcd xhci-hcd.2.auto: xHCI host controller not
> responding, assume dead
> [   89.905130] xhci-hcd xhci-hcd.2.auto: HC died; cleaning up
> [   89.911491] xhci-hcd xhci-hcd.2.auto: Timeout while waiting for
> configure endpoint command
> 
> > I use my kevin as a "real" laptop, which means it gets suspended/resumed
> > at least 20 times a day, no reboots involved (unless I'm actually testing
> > arm64 code on it).
> 
> Enric: I know you've been working with Kevin stuff a lot.  Any chance
> you reproduce Marc's failures and also see that it's fixed with his
> hack patch?
> 
> Marc: have you posted actual logs for the failing case (after picking
> my dts fix) somewhere?

just to interject, I'd guess that this is more an issue on Marc's branch?
I.e. I was testing the big hunk of analogix patches just now and did try
suspending as well and apart from the obvious fixes for spi / clk
the system suspends + resumes and I also get the keyboard back
in working condition after resume. Of course my nfsroot still dies
so I cannot test multiple suspend cycles, but at least once per boot
it works.

Test-branch in question is at
https://github.com/mmind/linux-rockchip/tree/tmp/testing_20180312

Base is my devel/upstream, which is master + my for-next + drmmisc
+ stuff that already went into some maintainer tree and looked necessary.


Heiko
diff mbox

Patch

diff --git a/arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi b/arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi
index 6e50768a34ce..9ad54751d0d8 100644
--- a/arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi
@@ -406,8 +406,9 @@ 
 	wlan_pd_n: wlan-pd-n {
 		compatible = "regulator-fixed";
 		regulator-name = "wlan_pd_n";
+		pinctrl-names = "default";
+		pinctrl-0 = <&wlan_module_reset_l>;
 
-		/* Note the wlan_module_reset_l pinctrl */
 		enable-active-high;
 		gpio = <&gpio1 11 GPIO_ACTIVE_HIGH>;
 
@@ -988,12 +989,6 @@  ap_i2c_audio: &i2c8 {
 	pinctrl-0 = <
 		&ap_pwroff	/* AP will auto-assert this when in S3 */
 		&clk_32k	/* This pin is always 32k on gru boards */
-
-		/*
-		 * We want this driven low ASAP; firmware should help us, but
-		 * we can help ourselves too.
-		 */
-		&wlan_module_reset_l
 	>;
 
 	pcfg_output_low: pcfg-output-low {
@@ -1173,12 +1168,7 @@  ap_i2c_audio: &i2c8 {
 		};
 
 		wlan_module_reset_l: wlan-module-reset-l {
-			/*
-			 * We want this driven low ASAP (As {Soon,Strongly} As
-			 * Possible), to avoid leakage through the powered-down
-			 * WiFi.
-			 */
-			rockchip,pins = <1 11 RK_FUNC_GPIO &pcfg_output_low>;
+			rockchip,pins = <1 11 RK_FUNC_GPIO &pcfg_pull_none>;
 		};
 
 		bt_host_wake_l: bt-host-wake-l {