diff mbox series

[v2] ALSA: hda/i915 - avoid hung task timeout in i915 wait

Message ID 20220308172759.920551-1-kai.vehmanen@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series [v2] ALSA: hda/i915 - avoid hung task timeout in i915 wait | expand

Commit Message

Kai Vehmanen March 8, 2022, 5:27 p.m. UTC
If kernel is built with hung task detection enabled and
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT set to less than 60 seconds,
snd_hdac_i915_init() will trigger the hung task timeout in case i915 is
not available and taint the kernel.

Split the 60sec wait into a loop of smaller waits to avoid this.

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Co-developed-by: Ramalingam C <ramalingam.c@intel.com>
Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
---
 sound/hda/hdac_i915.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Changes V1->V2:
 - address local variable naming issue raised by Amadeusz
   and use Takashi's proposal


base-commit: fd7698cf0858f8c5e659b655109cd93c2f15cdd3

Comments

Tvrtko Ursulin March 9, 2022, 8:36 a.m. UTC | #1
On 08/03/2022 17:27, Kai Vehmanen wrote:
> If kernel is built with hung task detection enabled and
> CONFIG_DEFAULT_HUNG_TASK_TIMEOUT set to less than 60 seconds,
> snd_hdac_i915_init() will trigger the hung task timeout in case i915 is
> not available and taint the kernel.
> 
> Split the 60sec wait into a loop of smaller waits to avoid this.
> 
> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
> Co-developed-by: Ramalingam C <ramalingam.c@intel.com>
> Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
> Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
> ---
>   sound/hda/hdac_i915.c | 10 ++++++----
>   1 file changed, 6 insertions(+), 4 deletions(-)
> 
> Changes V1->V2:
>   - address local variable naming issue raised by Amadeusz
>     and use Takashi's proposal
> 
> diff --git a/sound/hda/hdac_i915.c b/sound/hda/hdac_i915.c
> index 454474ac5716..aa48bed7baf7 100644
> --- a/sound/hda/hdac_i915.c
> +++ b/sound/hda/hdac_i915.c
> @@ -143,7 +143,7 @@ static bool i915_gfx_present(void)
>   int snd_hdac_i915_init(struct hdac_bus *bus)
>   {
>   	struct drm_audio_component *acomp;
> -	int err;
> +	int err, i;
>   
>   	if (!i915_gfx_present())
>   		return -ENODEV;
> @@ -159,9 +159,11 @@ int snd_hdac_i915_init(struct hdac_bus *bus)
>   	if (!acomp->ops) {
>   		if (!IS_ENABLED(CONFIG_MODULES) ||
>   		    !request_module("i915")) {
> -			/* 60s timeout */

Where does this 60s come from and why is the fix to work around 
DEFAULT_HUNG_TASK_TIMEOUT in a hacky way deemed okay? For instance would 
limiting the wait here to whatever the kconfig is set to be an option?

Regards,

Tvrtko

> -			wait_for_completion_timeout(&acomp->master_bind_complete,
> -						    msecs_to_jiffies(60 * 1000));
> +			/* max 60s timeout */
> +			for (i = 0; i < 60; i++)
> +				if (wait_for_completion_timeout(&acomp->master_bind_complete,
> +								msecs_to_jiffies(1000)))
> +					break;
>   		}
>   	}
>   	if (!acomp->ops) {
> 
> base-commit: fd7698cf0858f8c5e659b655109cd93c2f15cdd3
Kai Vehmanen March 9, 2022, 8:39 a.m. UTC | #2
Hi,

On Wed, 9 Mar 2022, Tvrtko Ursulin wrote:

> > -			/* 60s timeout */
> 
> Where does this 60s come from and why is the fix to work around
> DEFAULT_HUNG_TASK_TIMEOUT in a hacky way deemed okay? For instance would
> limiting the wait here to whatever the kconfig is set to be an option?

this was discussed in
https://lists.freedesktop.org/archives/intel-gfx/2022-February/290821.html
... and that thread concluded it's cleaner to split the wait than try
to figure out hung-task configuration from middle of audio driver.

The 60sec timeout comes from 2019 patch "ALSA: hda: Extend i915 component 
bind timeout" to fix an issue reported by Paul Menzel (cc'ed).

This patch keeps the timeout intact.
Br, Kai
Takashi Iwai March 9, 2022, 8:55 a.m. UTC | #3
On Wed, 09 Mar 2022 09:36:54 +0100,
Tvrtko Ursulin wrote:
> 
> 
> On 08/03/2022 17:27, Kai Vehmanen wrote:
> > If kernel is built with hung task detection enabled and
> > CONFIG_DEFAULT_HUNG_TASK_TIMEOUT set to less than 60 seconds,
> > snd_hdac_i915_init() will trigger the hung task timeout in case i915 is
> > not available and taint the kernel.
> >
> > Split the 60sec wait into a loop of smaller waits to avoid this.
> >
> > Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
> > Co-developed-by: Ramalingam C <ramalingam.c@intel.com>
> > Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
> > Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
> > ---
> >   sound/hda/hdac_i915.c | 10 ++++++----
> >   1 file changed, 6 insertions(+), 4 deletions(-)
> >
> > Changes V1->V2:
> >   - address local variable naming issue raised by Amadeusz
> >     and use Takashi's proposal
> >
> > diff --git a/sound/hda/hdac_i915.c b/sound/hda/hdac_i915.c
> > index 454474ac5716..aa48bed7baf7 100644
> > --- a/sound/hda/hdac_i915.c
> > +++ b/sound/hda/hdac_i915.c
> > @@ -143,7 +143,7 @@ static bool i915_gfx_present(void)
> >   int snd_hdac_i915_init(struct hdac_bus *bus)
> >   {
> >   	struct drm_audio_component *acomp;
> > -	int err;
> > +	int err, i;
> >     	if (!i915_gfx_present())
> >   		return -ENODEV;
> > @@ -159,9 +159,11 @@ int snd_hdac_i915_init(struct hdac_bus *bus)
> >   	if (!acomp->ops) {
> >   		if (!IS_ENABLED(CONFIG_MODULES) ||
> >   		    !request_module("i915")) {
> > -			/* 60s timeout */
> 
> Where does this 60s come from and why is the fix to work around
> DEFAULT_HUNG_TASK_TIMEOUT in a hacky way deemed okay?

The 60s timeout comes from the fact that the binding with i915 *might*
be mandatory for HD-audio driver on some platforms, while the binding
couldn't be achieved depending on the dynamic configuration change or
any other reasons, so we don't want to block forver.  And, basically
the hung check is false-positive, and if there is a better way to mark
to skip the hung check, we'd take it. But currently this seems to be
the easiest workaround for avoiding the false-positive checks.

> For instance
> would limiting the wait here to whatever the kconfig is set to be an
> option?

No, the hunk task timeout can be changed dynamically via procfs, hence
the fixed Kconfig won't help at all.


Takashi
Tvrtko Ursulin March 9, 2022, 9:02 a.m. UTC | #4
On 09/03/2022 08:39, Kai Vehmanen wrote:
> Hi,
> 
> On Wed, 9 Mar 2022, Tvrtko Ursulin wrote:
> 
>>> -			/* 60s timeout */
>>
>> Where does this 60s come from and why is the fix to work around
>> DEFAULT_HUNG_TASK_TIMEOUT in a hacky way deemed okay? For instance would
>> limiting the wait here to whatever the kconfig is set to be an option?
> 
> this was discussed in
> https://lists.freedesktop.org/archives/intel-gfx/2022-February/290821.html
> ... and that thread concluded it's cleaner to split the wait than try
> to figure out hung-task configuration from middle of audio driver.
> 
> The 60sec timeout comes from 2019 patch "ALSA: hda: Extend i915 component
> bind timeout" to fix an issue reported by Paul Menzel (cc'ed).
> 
> This patch keeps the timeout intact.

I did not spot discussion touching on the point I raised.

How about not fight the hung task detector but mark your wait context as 
"I really know what I'm doing - not stuck trust me". Maybe using 
wait_for_completion_killable_timeout would do it since 
snd_hdac_i915_init is allowed to fail with an error already?

Regards,

Tvrtko
Takashi Iwai March 9, 2022, 9:23 a.m. UTC | #5
On Wed, 09 Mar 2022 10:02:13 +0100,
Tvrtko Ursulin wrote:
> 
> 
> On 09/03/2022 08:39, Kai Vehmanen wrote:
> > Hi,
> >
> > On Wed, 9 Mar 2022, Tvrtko Ursulin wrote:
> >
> >>> -			/* 60s timeout */
> >>
> >> Where does this 60s come from and why is the fix to work around
> >> DEFAULT_HUNG_TASK_TIMEOUT in a hacky way deemed okay? For instance would
> >> limiting the wait here to whatever the kconfig is set to be an option?
> >
> > this was discussed in
> > https://lists.freedesktop.org/archives/intel-gfx/2022-February/290821.html
> > ... and that thread concluded it's cleaner to split the wait than try
> > to figure out hung-task configuration from middle of audio driver.
> >
> > The 60sec timeout comes from 2019 patch "ALSA: hda: Extend i915 component
> > bind timeout" to fix an issue reported by Paul Menzel (cc'ed).
> >
> > This patch keeps the timeout intact.
> 
> I did not spot discussion touching on the point I raised.
> 
> How about not fight the hung task detector but mark your wait context
> as "I really know what I'm doing - not stuck trust me".

The question is how often this problem hits.  Basically it's a very
corner case, and I even think we may leave as is; that's a matter of
configuration, and lowering such a bar should expect some
side-effect. OTOH, if the problem happens in many cases, it's
beneficial to fix in the core part, indeed.

> Maybe using
> wait_for_completion_killable_timeout would do it since
> snd_hdac_i915_init is allowed to fail with an error already?

It makes it killable -- which is a complete behavior change.


Takashi
Tvrtko Ursulin March 9, 2022, 9:48 a.m. UTC | #6
On 09/03/2022 09:23, Takashi Iwai wrote:
> On Wed, 09 Mar 2022 10:02:13 +0100,
> Tvrtko Ursulin wrote:
>>
>>
>> On 09/03/2022 08:39, Kai Vehmanen wrote:
>>> Hi,
>>>
>>> On Wed, 9 Mar 2022, Tvrtko Ursulin wrote:
>>>
>>>>> -			/* 60s timeout */
>>>>
>>>> Where does this 60s come from and why is the fix to work around
>>>> DEFAULT_HUNG_TASK_TIMEOUT in a hacky way deemed okay? For instance would
>>>> limiting the wait here to whatever the kconfig is set to be an option?
>>>
>>> this was discussed in
>>> https://lists.freedesktop.org/archives/intel-gfx/2022-February/290821.html
>>> ... and that thread concluded it's cleaner to split the wait than try
>>> to figure out hung-task configuration from middle of audio driver.
>>>
>>> The 60sec timeout comes from 2019 patch "ALSA: hda: Extend i915 component
>>> bind timeout" to fix an issue reported by Paul Menzel (cc'ed).
>>>
>>> This patch keeps the timeout intact.
>>
>> I did not spot discussion touching on the point I raised.
>>
>> How about not fight the hung task detector but mark your wait context
>> as "I really know what I'm doing - not stuck trust me".
> 
> The question is how often this problem hits.  Basically it's a very
> corner case, and I even think we may leave as is; that's a matter of
> configuration, and lowering such a bar should expect some
> side-effect. OTOH, if the problem happens in many cases, it's
> beneficial to fix in the core part, indeed.

Yes argument you raise can be made I agree.

>> Maybe using
>> wait_for_completion_killable_timeout would do it since
>> snd_hdac_i915_init is allowed to fail with an error already?
> 
> It makes it killable -- which is a complete behavior change.

Complete behaviour change how? Isn't this something ran on probe so 
likelihood of anyone sending SIGKILL to the modprobe process is only the 
init process? And in that case what is the fundamental difference in 
init giving up before the internal 60s in HDA driver does? I don't see a 
difference. Either party decided to abort the wait and code can just 
unwind and propagate the different error codes.

Regards,

Tvrtko
Takashi Iwai March 9, 2022, 9:55 a.m. UTC | #7
On Wed, 09 Mar 2022 10:48:49 +0100,
Tvrtko Ursulin wrote:
> 
> 
> On 09/03/2022 09:23, Takashi Iwai wrote:
> > On Wed, 09 Mar 2022 10:02:13 +0100,
> > Tvrtko Ursulin wrote:
> >>
> >>
> >> On 09/03/2022 08:39, Kai Vehmanen wrote:
> >>> Hi,
> >>>
> >>> On Wed, 9 Mar 2022, Tvrtko Ursulin wrote:
> >>>
> >>>>> -			/* 60s timeout */
> >>>>
> >>>> Where does this 60s come from and why is the fix to work around
> >>>> DEFAULT_HUNG_TASK_TIMEOUT in a hacky way deemed okay? For instance would
> >>>> limiting the wait here to whatever the kconfig is set to be an option?
> >>>
> >>> this was discussed in
> >>> https://lists.freedesktop.org/archives/intel-gfx/2022-February/290821.html
> >>> ... and that thread concluded it's cleaner to split the wait than try
> >>> to figure out hung-task configuration from middle of audio driver.
> >>>
> >>> The 60sec timeout comes from 2019 patch "ALSA: hda: Extend i915 component
> >>> bind timeout" to fix an issue reported by Paul Menzel (cc'ed).
> >>>
> >>> This patch keeps the timeout intact.
> >>
> >> I did not spot discussion touching on the point I raised.
> >>
> >> How about not fight the hung task detector but mark your wait context
> >> as "I really know what I'm doing - not stuck trust me".
> >
> > The question is how often this problem hits.  Basically it's a very
> > corner case, and I even think we may leave as is; that's a matter of
> > configuration, and lowering such a bar should expect some
> > side-effect. OTOH, if the problem happens in many cases, it's
> > beneficial to fix in the core part, indeed.
> 
> Yes argument you raise can be made I agree.
> 
> >> Maybe using
> >> wait_for_completion_killable_timeout would do it since
> >> snd_hdac_i915_init is allowed to fail with an error already?
> >
> > It makes it killable -- which is a complete behavior change.
> 
> Complete behaviour change how? Isn't this something ran on probe so
> likelihood of anyone sending SIGKILL to the modprobe process is only
> the init process? And in that case what is the fundamental difference
> in init giving up before the internal 60s in HDA driver does? I don't
> see a difference. Either party decided to abort the wait and code can
> just unwind and propagate the different error codes.

The point is that it does change the actual behavior, and changing the
actual behavior just for working around the corner case like the above
wouldn't be justified without the proper evaluation.

That said, if this behavior change is intentional and even desired,
that's a way to go.


Takashi
Kai Vehmanen March 9, 2022, 1:05 p.m. UTC | #8
Hi,

On Wed, 9 Mar 2022, Takashi Iwai wrote:

>> Takashi Iwai wrote:
>>> The question is how often this problem hits.  Basically it's a very
>>> corner case, and I even think we may leave as is; that's a matter of
>>> configuration, and lowering such a bar should expect some
>>> side-effect. OTOH, if the problem happens in many cases, it's
>>> beneficial to fix in the core part, indeed.

I'm basicly helping out the intel-gfx folks here. This is now happening 
systematically in the intel-gfx CI. The hung-task timeout is configured to 
30sec (in intel-gfx CI), and there's some new hw configs where this 
happens every time (I have a separate patch in progress [1] that tries 
to detect this case and skip the init, but this will require more time as there is 
risk of breaking existing configurations).

[1] 
https://lists.freedesktop.org/archives/intel-gfx-trybot/2022-February/128278.html

Tvrtko Ursulin wrote:
> Takashi Iwai wrote:
> > Complete behaviour change how? Isn't this something ran on probe so
> > likelihood of anyone sending SIGKILL to the modprobe process is only
> > the init process? And in that case what is the fundamental difference
>
[...]
> The point is that it does change the actual behavior, and changing the
> actual behavior just for working around the corner case like the above
> wouldn't be justified without the proper evaluation.
> 
> That said, if this behavior change is intentional and even desired,
> that's a way to go.

Let me try this out and test on a few configs (with and without the 
timeout occurring) and send a V3 if this seems ok.

Br, Kai
diff mbox series

Patch

diff --git a/sound/hda/hdac_i915.c b/sound/hda/hdac_i915.c
index 454474ac5716..aa48bed7baf7 100644
--- a/sound/hda/hdac_i915.c
+++ b/sound/hda/hdac_i915.c
@@ -143,7 +143,7 @@  static bool i915_gfx_present(void)
 int snd_hdac_i915_init(struct hdac_bus *bus)
 {
 	struct drm_audio_component *acomp;
-	int err;
+	int err, i;
 
 	if (!i915_gfx_present())
 		return -ENODEV;
@@ -159,9 +159,11 @@  int snd_hdac_i915_init(struct hdac_bus *bus)
 	if (!acomp->ops) {
 		if (!IS_ENABLED(CONFIG_MODULES) ||
 		    !request_module("i915")) {
-			/* 60s timeout */
-			wait_for_completion_timeout(&acomp->master_bind_complete,
-						    msecs_to_jiffies(60 * 1000));
+			/* max 60s timeout */
+			for (i = 0; i < 60; i++)
+				if (wait_for_completion_timeout(&acomp->master_bind_complete,
+								msecs_to_jiffies(1000)))
+					break;
 		}
 	}
 	if (!acomp->ops) {