diff mbox

[RFC] drm/radeon: return 0 on successful gpu reset

Message ID 1355779884.1414.14.camel@x61.thuisdomein (mailing list archive)
State New, archived
Headers show

Commit Message

Paul Bolle Dec. 17, 2012, 9:31 p.m. UTC
On an (outdated) laptop the radeon driver (almost always) prints, during
the first resume of each session:
    [drm] crtc 1 is connected to a TV

This message is a bit puzzling as, as far as I know, no TV has ever
been connected to this laptop. Anyhow, before v3.5, if that happened the
radeon driver then printed an error during all following resumes:
    [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -35!

(-35 is -EDEADLK.) But the resume would succeed and the driver seemed to
run without too much trouble. From v3.5 onwards things changed. If the
(puzzling) message about crtc 1 was printed on first resume the laptop
would simply hang on second resume. Only a manual power off would then
be possible. In that case nothing of interest would be found in the
(truncated) logs.  And, most annoyingly, the hang would never happen if
the laptop was booted with, say, "console=ttyS0,115200n8" added to the
kernel command line.

I bisected the hang to commit 6c6f478370eccfbfafbdc6fc55c0def03e58f124
("drm/radeon: rework recursive gpu reset handling"), which was added in
the v3.5 release cycle. After discovering that and poking at the driver
it turned out that this hang is triggered by radeon_cs_handle_lockup()
returning -EAGAIN after successfully resetting the gpu. Simply returning
0 makes the hang disappear (and makes the drm error reappear).

Nothing in the code or the commit explanation clarifies why -EAGAIN
should be returned on successful gpu reset. So I suggest
radeon_cs_handle_lockup() simply returns what radeon_gpu_reset()
returns, eg 0 (on success) or a negative error code (on failure).

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
---
0) This exact patch is untested (but I run something comparable).

1) Sent as an RFC because I do not understand why this laptop (almost
always) prints the "crtc 1" message on first resume. Note that another
workaround for this hang is simply booting with "radeon.tv=0".

2) Also sent as an RFC because I have no idea whatsoever why returning
-EAGAIN will hang the machine. I guess it's returned to userland by
radeon_cs_ioctl(). What code uses that ioctl? And what does that code do
on -EAGAIN that hangs this laptop?

3) A third reason to send this as an RFC is that I also have no idea why
this hang doesn't happen when booting with "console=ttyS0,115200n8" or
even "console=tty0"! But I guess I'm now allowed to call this hang a
Heisenbug.

 drivers/gpu/drm/radeon/radeon_cs.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

Comments

Christian König Dec. 18, 2012, 10:36 a.m. UTC | #1
On 17.12.2012 22:31, Paul Bolle wrote:
> On an (outdated) laptop the radeon driver (almost always) prints, during
> the first resume of each session:
>      [drm] crtc 1 is connected to a TV
>
> This message is a bit puzzling as, as far as I know, no TV has ever
> been connected to this laptop. Anyhow, before v3.5, if that happened the
> radeon driver then printed an error during all following resumes:
>      [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -35!
>
> (-35 is -EDEADLK.) But the resume would succeed and the driver seemed to
> run without too much trouble. From v3.5 onwards things changed. If the
> (puzzling) message about crtc 1 was printed on first resume the laptop
> would simply hang on second resume. Only a manual power off would then
> be possible. In that case nothing of interest would be found in the
> (truncated) logs.  And, most annoyingly, the hang would never happen if
> the laptop was booted with, say, "console=ttyS0,115200n8" added to the
> kernel command line.
>
> I bisected the hang to commit 6c6f478370eccfbfafbdc6fc55c0def03e58f124
> ("drm/radeon: rework recursive gpu reset handling"), which was added in
> the v3.5 release cycle. After discovering that and poking at the driver
> it turned out that this hang is triggered by radeon_cs_handle_lockup()
> returning -EAGAIN after successfully resetting the gpu. Simply returning
> 0 makes the hang disappear (and makes the drm error reappear).
>
> Nothing in the code or the commit explanation clarifies why -EAGAIN
> should be returned on successful gpu reset. So I suggest
> radeon_cs_handle_lockup() simply returns what radeon_gpu_reset()
> returns, eg 0 (on success) or a negative error code (on failure).
>
> Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
> ---
> 0) This exact patch is untested (but I run something comparable).
>
> 1) Sent as an RFC because I do not understand why this laptop (almost
> always) prints the "crtc 1" message on first resume. Note that another
> workaround for this hang is simply booting with "radeon.tv=0".
Alex should probably take a look into this, since he probably is the one 
with the deepest knowledge of the display engine. My best guess is that 
it is just some error while probing for an attached TV and actually 
isn't so bad after all.

> 2) Also sent as an RFC because I have no idea whatsoever why returning
> -EAGAIN will hang the machine. I guess it's returned to userland by
> radeon_cs_ioctl(). What code uses that ioctl? And what does that code do
> on -EAGAIN that hangs this laptop?

EAGAIN just tells userspace to reissue the requested system call. When a 
system call is interrupted (either by a signal or in our case a GPU 
lockup) it aborts and EGAIN is returned to userspace, telling userspace 
that it should try again. So by just returning 0 userspace things that 
our system call was executed and doesn't try it again.

So you just prevented the normal reissuing of the system call and so 
also prevented whatever this command submission should be doing in the 
first place.

> 3) A third reason to send this as an RFC is that I also have no idea why
> this hang doesn't happen when booting with "console=ttyS0,115200n8" or
> even "console=tty0"! But I guess I'm now allowed to call this hang a
> Heisenbug.
"Heisenbug" ? LOL, I need to remember that.

But anyway it is not so unusual seeing a bug like this, cause it is 
possible (but highly unlikely) that actually trying to print an error 
message can cause a lockup.

You should definitely try Alex latest drm-fixes-3.8 branch 
(git://people.freedesktop.org/~agd5f/linux) since the possibility is 
quite high that we already have fixed that bug. If that doesn't helps 
then please open a bug report and leave me a note so that I can 
investigate further.

Christian.


>
>   drivers/gpu/drm/radeon/radeon_cs.c |    5 +----
>   1 files changed, 1 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
> index 41672cc..a302c00 100644
> --- a/drivers/gpu/drm/radeon/radeon_cs.c
> +++ b/drivers/gpu/drm/radeon/radeon_cs.c
> @@ -486,11 +486,8 @@ out:
>   
>   static int radeon_cs_handle_lockup(struct radeon_device *rdev, int r)
>   {
> -	if (r == -EDEADLK) {
> +	if (r == -EDEADLK)
>   		r = radeon_gpu_reset(rdev);
> -		if (!r)
> -			r = -EAGAIN;
> -	}
>   	return r;
>   }
>
Alex Deucher Dec. 18, 2012, 3:22 p.m. UTC | #2
On Tue, Dec 18, 2012 at 5:36 AM, Christian König
<deathsimple@vodafone.de> wrote:
> On 17.12.2012 22:31, Paul Bolle wrote:
>>
>> On an (outdated) laptop the radeon driver (almost always) prints, during
>> the first resume of each session:
>>      [drm] crtc 1 is connected to a TV
>>
>> This message is a bit puzzling as, as far as I know, no TV has ever
>> been connected to this laptop. Anyhow, before v3.5, if that happened the
>> radeon driver then printed an error during all following resumes:
>>      [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -35!
>>
>> (-35 is -EDEADLK.) But the resume would succeed and the driver seemed to
>> run without too much trouble. From v3.5 onwards things changed. If the
>> (puzzling) message about crtc 1 was printed on first resume the laptop
>> would simply hang on second resume. Only a manual power off would then
>> be possible. In that case nothing of interest would be found in the
>> (truncated) logs.  And, most annoyingly, the hang would never happen if
>> the laptop was booted with, say, "console=ttyS0,115200n8" added to the
>> kernel command line.
>>
>> I bisected the hang to commit 6c6f478370eccfbfafbdc6fc55c0def03e58f124
>> ("drm/radeon: rework recursive gpu reset handling"), which was added in
>> the v3.5 release cycle. After discovering that and poking at the driver
>> it turned out that this hang is triggered by radeon_cs_handle_lockup()
>> returning -EAGAIN after successfully resetting the gpu. Simply returning
>> 0 makes the hang disappear (and makes the drm error reappear).
>>
>> Nothing in the code or the commit explanation clarifies why -EAGAIN
>> should be returned on successful gpu reset. So I suggest
>> radeon_cs_handle_lockup() simply returns what radeon_gpu_reset()
>> returns, eg 0 (on success) or a negative error code (on failure).
>>
>> Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
>> ---
>> 0) This exact patch is untested (but I run something comparable).
>>
>> 1) Sent as an RFC because I do not understand why this laptop (almost
>> always) prints the "crtc 1" message on first resume. Note that another
>> workaround for this hang is simply booting with "radeon.tv=0".
>
> Alex should probably take a look into this, since he probably is the one
> with the deepest knowledge of the display engine. My best guess is that it
> is just some error while probing for an attached TV and actually isn't so
> bad after all.

TV detection is always a bit flakey.  I suspect some register or gpio
related to TV out doesn't get restored properly on resume which
results in a false positive.  What asic is this?

>
>
>> 2) Also sent as an RFC because I have no idea whatsoever why returning
>> -EAGAIN will hang the machine. I guess it's returned to userland by
>> radeon_cs_ioctl(). What code uses that ioctl? And what does that code do
>> on -EAGAIN that hangs this laptop?
>
>
> EAGAIN just tells userspace to reissue the requested system call. When a
> system call is interrupted (either by a signal or in our case a GPU lockup)
> it aborts and EGAIN is returned to userspace, telling userspace that it
> should try again. So by just returning 0 userspace things that our system
> call was executed and doesn't try it again.
>
> So you just prevented the normal reissuing of the system call and so also
> prevented whatever this command submission should be doing in the first
> place.
>
>
>> 3) A third reason to send this as an RFC is that I also have no idea why
>> this hang doesn't happen when booting with "console=ttyS0,115200n8" or
>> even "console=tty0"! But I guess I'm now allowed to call this hang a
>> Heisenbug.
>
> "Heisenbug" ? LOL, I need to remember that.
>
> But anyway it is not so unusual seeing a bug like this, cause it is possible
> (but highly unlikely) that actually trying to print an error message can
> cause a lockup.
>
> You should definitely try Alex latest drm-fixes-3.8 branch
> (git://people.freedesktop.org/~agd5f/linux) since the possibility is quite
> high that we already have fixed that bug. If that doesn't helps then please
> open a bug report and leave me a note so that I can investigate further.

I don't have a 3.8 fixes branch up just yet, but will soon.

Alex

>
> Christian.
>
>
>
>>
>>   drivers/gpu/drm/radeon/radeon_cs.c |    5 +----
>>   1 files changed, 1 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/radeon/radeon_cs.c
>> b/drivers/gpu/drm/radeon/radeon_cs.c
>> index 41672cc..a302c00 100644
>> --- a/drivers/gpu/drm/radeon/radeon_cs.c
>> +++ b/drivers/gpu/drm/radeon/radeon_cs.c
>> @@ -486,11 +486,8 @@ out:
>>     static int radeon_cs_handle_lockup(struct radeon_device *rdev, int r)
>>   {
>> -       if (r == -EDEADLK) {
>> +       if (r == -EDEADLK)
>>                 r = radeon_gpu_reset(rdev);
>> -               if (!r)
>> -                       r = -EAGAIN;
>> -       }
>>         return r;
>>   }
>>
>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
Paul Bolle Dec. 18, 2012, 5:28 p.m. UTC | #3
On Tue, 2012-12-18 at 10:22 -0500, Alex Deucher wrote:
> On Tue, Dec 18, 2012 at 5:36 AM, Christian König
> <deathsimple@vodafone.de> wrote:
> > On 17.12.2012 22:31, Paul Bolle wrote:
> >> 1) Sent as an RFC because I do not understand why this laptop (almost
> >> always) prints the "crtc 1" message on first resume. Note that another
> >> workaround for this hang is simply booting with "radeon.tv=0".
> >
> > Alex should probably take a look into this, since he probably is the one
> > with the deepest knowledge of the display engine. My best guess is that it
> > is just some error while probing for an attached TV and actually isn't so
> > bad after all.
> 
> TV detection is always a bit flakey.  I suspect some register or gpio
> related to TV out doesn't get restored properly on resume which
> results in a false positive.  What asic is this?

It's an RV250, which makes the asic r200, doesn't it?

> > You should definitely try Alex latest drm-fixes-3.8 branch
> > (git://people.freedesktop.org/~agd5f/linux) since the possibility is quite
> > high that we already have fixed that bug. If that doesn't helps then please
> > open a bug report and leave me a note so that I can investigate further.
> 
> I don't have a 3.8 fixes branch up just yet, but will soon.

I see. I'll watch that repository, but I would appreciate it if you
notify me if that branch pops up.


Paul Bolle
Alex Deucher Dec. 18, 2012, 5:29 p.m. UTC | #4
On Tue, Dec 18, 2012 at 12:28 PM, Paul Bolle <pebolle@tiscali.nl> wrote:
> On Tue, 2012-12-18 at 10:22 -0500, Alex Deucher wrote:
>> On Tue, Dec 18, 2012 at 5:36 AM, Christian König
>> <deathsimple@vodafone.de> wrote:
>> > On 17.12.2012 22:31, Paul Bolle wrote:
>> >> 1) Sent as an RFC because I do not understand why this laptop (almost
>> >> always) prints the "crtc 1" message on first resume. Note that another
>> >> workaround for this hang is simply booting with "radeon.tv=0".
>> >
>> > Alex should probably take a look into this, since he probably is the one
>> > with the deepest knowledge of the display engine. My best guess is that it
>> > is just some error while probing for an attached TV and actually isn't so
>> > bad after all.
>>
>> TV detection is always a bit flakey.  I suspect some register or gpio
>> related to TV out doesn't get restored properly on resume which
>> results in a false positive.  What asic is this?
>
> It's an RV250, which makes the asic r200, doesn't it?

Yes, it's an r2xx variant.

>
>> > You should definitely try Alex latest drm-fixes-3.8 branch
>> > (git://people.freedesktop.org/~agd5f/linux) since the possibility is quite
>> > high that we already have fixed that bug. If that doesn't helps then please
>> > open a bug report and leave me a note so that I can investigate further.
>>
>> I don't have a 3.8 fixes branch up just yet, but will soon.
>
> I see. I'll watch that repository, but I would appreciate it if you
> notify me if that branch pops up.

Will do.

Alex
Alex Deucher Dec. 20, 2012, 6:37 p.m. UTC | #5
On Tue, Dec 18, 2012 at 12:28 PM, Paul Bolle <pebolle@tiscali.nl> wrote:
> On Tue, 2012-12-18 at 10:22 -0500, Alex Deucher wrote:
>> On Tue, Dec 18, 2012 at 5:36 AM, Christian König
>> <deathsimple@vodafone.de> wrote:
>> > On 17.12.2012 22:31, Paul Bolle wrote:
>> >> 1) Sent as an RFC because I do not understand why this laptop (almost
>> >> always) prints the "crtc 1" message on first resume. Note that another
>> >> workaround for this hang is simply booting with "radeon.tv=0".
>> >
>> > Alex should probably take a look into this, since he probably is the one
>> > with the deepest knowledge of the display engine. My best guess is that it
>> > is just some error while probing for an attached TV and actually isn't so
>> > bad after all.
>>
>> TV detection is always a bit flakey.  I suspect some register or gpio
>> related to TV out doesn't get restored properly on resume which
>> results in a false positive.  What asic is this?
>
> It's an RV250, which makes the asic r200, doesn't it?
>
>> > You should definitely try Alex latest drm-fixes-3.8 branch
>> > (git://people.freedesktop.org/~agd5f/linux) since the possibility is quite
>> > high that we already have fixed that bug. If that doesn't helps then please
>> > open a bug report and leave me a note so that I can investigate further.
>>
>> I don't have a 3.8 fixes branch up just yet, but will soon.
>
> I see. I'll watch that repository, but I would appreciate it if you
> notify me if that branch pops up.

http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-3.8

Alex

>
>
> Paul Bolle
>
Paul Bolle Dec. 21, 2012, 8:50 a.m. UTC | #6
On Thu, 2012-12-20 at 13:37 -0500, Alex Deucher wrote:
> On Tue, Dec 18, 2012 at 12:28 PM, Paul Bolle <pebolle@tiscali.nl> wrote:
> > On Tue, 2012-12-18 at 10:22 -0500, Alex Deucher wrote:
> >> On Tue, Dec 18, 2012 at 5:36 AM, Christian König
> >> <deathsimple@vodafone.de> wrote:
> >> > You should definitely try Alex latest drm-fixes-3.8 branch
> >> > (git://people.freedesktop.org/~agd5f/linux) since the possibility is quite
> >> > high that we already have fixed that bug. If that doesn't helps then please
> >> > open a bug report and leave me a note so that I can investigate further.
> >>
> >> I don't have a 3.8 fixes branch up just yet, but will soon.
> >
> > I see. I'll watch that repository, but I would appreciate it if you
> > notify me if that branch pops up.
> 
> http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-3.8

Thanks.

I tried drm-fixes-3.8, at commit
4613ca14b9739428abb53bef9cd0f8b3fee23a95 ("drm/radeon: add support for
MEM_WRITE packet"), on top of v3.7. It hangs in the exact same way.

Christian, were should I open a bug, and under which component, etc.?


Paul Bolle
diff mbox

Patch

diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
index 41672cc..a302c00 100644
--- a/drivers/gpu/drm/radeon/radeon_cs.c
+++ b/drivers/gpu/drm/radeon/radeon_cs.c
@@ -486,11 +486,8 @@  out:
 
 static int radeon_cs_handle_lockup(struct radeon_device *rdev, int r)
 {
-	if (r == -EDEADLK) {
+	if (r == -EDEADLK)
 		r = radeon_gpu_reset(rdev);
-		if (!r)
-			r = -EAGAIN;
-	}
 	return r;
 }