[2/6] gpu: host1x: Fix syncpoint wait return value

Message ID	1368791388-31441-3-git-send-email-amerilainen@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces+patchwork-dri-devel=patchwork.kernel.org@lists.freedesktop.org> From: Arto Merilainen <amerilainen@nvidia.com> To: <thierry.reding@avionic-design.de>, <airlied@linux.ie>, <linux-tegra@vger.kernel.org> Subject: [PATCH 2/6] gpu: host1x: Fix syncpoint wait return value Date: Fri, 17 May 2013 14:49:44 +0300 Message-ID: <1368791388-31441-3-git-send-email-amerilainen@nvidia.com> In-Reply-To: <1368791388-31441-1-git-send-email-amerilainen@nvidia.com> References: <1368791388-31441-1-git-send-email-amerilainen@nvidia.com> MIME-Version: 1.0 Cc: tbergstrom@nvidia.com, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, Arto Merilainen <amerilainen@nvidia.com> Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: dri-devel-bounces+patchwork-dri-devel=patchwork.kernel.org@lists.freedesktop.org Errors-To: dri-devel-bounces+patchwork-dri-devel=patchwork.kernel.org@lists.freedesktop.org

Arto Merilainen May 17, 2013, 11:49 a.m. UTC

Syncpoint wait returned EAGAIN if it was called with zero timeout.
This patch modifies the function to return ETIMEDOUT.

Signed-off-by: Arto Merilainen <amerilainen@nvidia.com>
---
 drivers/gpu/host1x/syncpt.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Thierry Reding May 26, 2013, 10:12 a.m. UTC | #1

On Fri, May 17, 2013 at 02:49:44PM +0300, Arto Merilainen wrote:
> Syncpoint wait returned EAGAIN if it was called with zero timeout.
> This patch modifies the function to return ETIMEDOUT.

This description is a bit redundant, because it repeats in prose what
the code does. I'd rather see a description of why the change is
necessary.

Thinking about it, maybe it would be good to have two separate error
codes. Keeping -EAGAIN for the case where a zero timeout was passed
doesn't sound too bad to differentiate it from the case where a non-
zero timeout was passed and it actually timed out. What do you think?

Thierry

Arto Merilainen May 27, 2013, 6:55 a.m. UTC | #2

On 05/26/2013 01:12 PM, Thierry Reding wrote:
> * PGP Signed by an unknown key
>
> On Fri, May 17, 2013 at 02:49:44PM +0300, Arto Merilainen wrote:
>> Syncpoint wait returned EAGAIN if it was called with zero timeout.
>> This patch modifies the function to return ETIMEDOUT.
>
> This description is a bit redundant, because it repeats in prose what
> the code does. I'd rather see a description of why the change is
> necessary.

True. Will fix.

>
> Thinking about it, maybe it would be good to have two separate error
> codes. Keeping -EAGAIN for the case where a zero timeout was passed
> doesn't sound too bad to differentiate it from the case where a non-
> zero timeout was passed and it actually timed out. What do you think?

I agree, in this case it would not look bad at all. However, user space 
libraries may loop until the ioctl return code is something else than 
-EAGAIN or -EINTR. Especially function drmIoctl() in libdrm does this 
which is why I noted this isssue in the first place.

If user space uses zero timeout to just check if a syncpoint value has 
already passed the library continues looping until the syncpoint value 
actually passes. Of course, we could just modify the ioctl interface to 
"cast" this return code to something else but that does not seem correct.

- Arto

Thierry Reding May 28, 2013, 10:39 a.m. UTC | #3

On Mon, May 27, 2013 at 09:55:46AM +0300, Arto Merilainen wrote:
> On 05/26/2013 01:12 PM, Thierry Reding wrote:
> >* PGP Signed by an unknown key
> >
> >On Fri, May 17, 2013 at 02:49:44PM +0300, Arto Merilainen wrote:
[...]
> >Thinking about it, maybe it would be good to have two separate error
> >codes. Keeping -EAGAIN for the case where a zero timeout was passed
> >doesn't sound too bad to differentiate it from the case where a non-
> >zero timeout was passed and it actually timed out. What do you think?
> 
> I agree, in this case it would not look bad at all. However, user
> space libraries may loop until the ioctl return code is something
> else than -EAGAIN or -EINTR. Especially function drmIoctl() in
> libdrm does this which is why I noted this isssue in the first
> place.
> 
> If user space uses zero timeout to just check if a syncpoint value
> has already passed the library continues looping until the syncpoint
> value actually passes. Of course, we could just modify the ioctl
> interface to "cast" this return code to something else but that does
> not seem correct.

That doesn't sound right. Maybe drmIoctl() needs fixing instead. Looking
at the history, drmIoctl() was introduced to automatically loop if a
signal was received (commit 8b9ab108ec1f2ba2b503f713769c4946849b3cb2).
However the ioctl(3p) manpage doesn't mention that ioctl() returns
EAGAIN in case it is interrupted by a signal.

I'm adding Keith as author of that commit and the xorg-devel mailing
list on Cc to get some more eyes on this.

Thierry

Keith Packard May 28, 2013, 7:12 p.m. UTC | #4

Thierry Reding <thierry.reding@gmail.com> writes:

> That doesn't sound right. Maybe drmIoctl() needs fixing instead. Looking
> at the history, drmIoctl() was introduced to automatically loop if a
> signal was received (commit 8b9ab108ec1f2ba2b503f713769c4946849b3cb2).
> However the ioctl(3p) manpage doesn't mention that ioctl() returns
> EAGAIN in case it is interrupted by a signal.

EAGAIN is being returned when the GPU is wedged to ask the application
to re-submit the request, which will presumably be held until the  GPU
is un-wedged.

Thierry Reding June 11, 2013, 10:48 a.m. UTC | #5

On Tue, May 28, 2013 at 01:12:12PM -0600, Keith Packard wrote:
> Thierry Reding <thierry.reding@gmail.com> writes:
> 
> 
> > That doesn't sound right. Maybe drmIoctl() needs fixing instead. Looking
> > at the history, drmIoctl() was introduced to automatically loop if a
> > signal was received (commit 8b9ab108ec1f2ba2b503f713769c4946849b3cb2).
> > However the ioctl(3p) manpage doesn't mention that ioctl() returns
> > EAGAIN in case it is interrupted by a signal.
> 
> EAGAIN is being returned when the GPU is wedged to ask the application
> to re-submit the request, which will presumably be held until the  GPU
> is un-wedged.

Isn't that a bit risky? What if something special needs to be done to
unwedge the GPU other than re-submit the request, or if it just can't
be reasonably unwedged. In that case drmIoctl() will keep looping
indefinitely.

If the above is indeed the expected behaviour for drivers, then we need
a different error code for the SYNCPT_WAIT ioctl. EAGAIN is the best fit
and anything else doesn't quite match the use-case. A different option
might be not to use drmIoctl() on Tegra.

Thierry

Daniel Vetter June 11, 2013, 11 a.m. UTC | #6

On Tue, Jun 11, 2013 at 12:48 PM, Thierry Reding
<thierry.reding@gmail.com> wrote:
> On Tue, May 28, 2013 at 01:12:12PM -0600, Keith Packard wrote:
>> Thierry Reding <thierry.reding@gmail.com> writes:
>>
>>
>> > That doesn't sound right. Maybe drmIoctl() needs fixing instead. Looking
>> > at the history, drmIoctl() was introduced to automatically loop if a
>> > signal was received (commit 8b9ab108ec1f2ba2b503f713769c4946849b3cb2).
>> > However the ioctl(3p) manpage doesn't mention that ioctl() returns
>> > EAGAIN in case it is interrupted by a signal.
>>
>> EAGAIN is being returned when the GPU is wedged to ask the application
>> to re-submit the request, which will presumably be held until the  GPU
>> is un-wedged.
>
> Isn't that a bit risky? What if something special needs to be done to
> unwedge the GPU other than re-submit the request, or if it just can't
> be reasonably unwedged. In that case drmIoctl() will keep looping
> indefinitely.
>
> If the above is indeed the expected behaviour for drivers, then we need
> a different error code for the SYNCPT_WAIT ioctl. EAGAIN is the best fit
> and anything else doesn't quite match the use-case. A different option
> might be not to use drmIoctl() on Tegra.

We don't use the EAGAIN ioctl restarting to resubmit the batchbuffer
which blew up the gpu (that one has been submitted already in a
different ioctl call), but to be able to restart the ioctl after the
reset has completed: We need to kick every thread which is potentially
holding GEM locks and make sure that we block them (at a point where
they don't hold any locks) until the reset handler completed. To avoid
a validation nightmare we use the same codepaths as we use for signal
interrupts, so ioctl restarting is a very natural fit for this.

Resubmitting victim workloads when a gpu crash happened is something
the reset handler would do (kernel work item currently), not any
userspace process doing an ioctl. But atm we don't resubmit victimized
workloads.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

Terje Bergstrom June 11, 2013, 11:43 a.m. UTC | #7

On 11.06.2013 14:00, Daniel Vetter wrote:
> We don't use the EAGAIN ioctl restarting to resubmit the batchbuffer
> which blew up the gpu (that one has been submitted already in a
> different ioctl call), but to be able to restart the ioctl after the
> reset has completed: We need to kick every thread which is potentially
> holding GEM locks and make sure that we block them (at a point where
> they don't hold any locks) until the reset handler completed. To avoid
> a validation nightmare we use the same codepaths as we use for signal
> interrupts, so ioctl restarting is a very natural fit for this.
> 
> Resubmitting victim workloads when a gpu crash happened is something
> the reset handler would do (kernel work item currently), not any
> userspace process doing an ioctl. But atm we don't resubmit victimized
> workloads.

I don't understand the end-to-end of how resubmit is supposed to work.
User space is not supposed to resubmit, but still EAGAIN is returned to
user space, and drmIoctl() in user space just calls the came ioctl
again. Sounds like drmIoctl() is completely wrong.

In Tegra, when a job blows up, we reset the involved units, and set the
pushbuffer pointer of host1x to point to the next job, and re-enable
units. There's no need for anybody to resubmit anything, as kernel
already has them.

Terje

Daniel Vetter June 11, 2013, 12:09 p.m. UTC | #8

On Tue, Jun 11, 2013 at 1:43 PM, Terje Bergström <tbergstrom@nvidia.com> wrote:
> On 11.06.2013 14:00, Daniel Vetter wrote:
>> We don't use the EAGAIN ioctl restarting to resubmit the batchbuffer
>> which blew up the gpu (that one has been submitted already in a
>> different ioctl call), but to be able to restart the ioctl after the
>> reset has completed: We need to kick every thread which is potentially
>> holding GEM locks and make sure that we block them (at a point where
>> they don't hold any locks) until the reset handler completed. To avoid
>> a validation nightmare we use the same codepaths as we use for signal
>> interrupts, so ioctl restarting is a very natural fit for this.
>>
>> Resubmitting victim workloads when a gpu crash happened is something
>> the reset handler would do (kernel work item currently), not any
>> userspace process doing an ioctl. But atm we don't resubmit victimized
>> workloads.
>
> I don't understand the end-to-end of how resubmit is supposed to work.
> User space is not supposed to resubmit, but still EAGAIN is returned to
> user space, and drmIoctl() in user space just calls the came ioctl
> again. Sounds like drmIoctl() is completely wrong.

Maybe it wasn't clear, but -EAGAIN does _not_ resubmit work. -EAGAIN
is used to restart the ioctl if we had to kick a thread (to make sure
it doesn't hold any locks), e.g. for a blocking wait on oustanding
rendering. The codepaths taken work exactly as if the thread is
interrupt with a signal.

> In Tegra, when a job blows up, we reset the involved units, and set the
> pushbuffer pointer of host1x to point to the next job, and re-enable
> units. There's no need for anybody to resubmit anything, as kernel
> already has them.

Yeah, that's how it works in i915.ko, too.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

Thierry Reding June 11, 2013, 6:19 p.m. UTC | #9

On Tue, Jun 11, 2013 at 02:09:31PM +0200, Daniel Vetter wrote:
> On Tue, Jun 11, 2013 at 1:43 PM, Terje Bergström <tbergstrom@nvidia.com> wrote:
> > On 11.06.2013 14:00, Daniel Vetter wrote:
> >> We don't use the EAGAIN ioctl restarting to resubmit the batchbuffer
> >> which blew up the gpu (that one has been submitted already in a
> >> different ioctl call), but to be able to restart the ioctl after the
> >> reset has completed: We need to kick every thread which is potentially
> >> holding GEM locks and make sure that we block them (at a point where
> >> they don't hold any locks) until the reset handler completed. To avoid
> >> a validation nightmare we use the same codepaths as we use for signal
> >> interrupts, so ioctl restarting is a very natural fit for this.
> >>
> >> Resubmitting victim workloads when a gpu crash happened is something
> >> the reset handler would do (kernel work item currently), not any
> >> userspace process doing an ioctl. But atm we don't resubmit victimized
> >> workloads.
> >
> > I don't understand the end-to-end of how resubmit is supposed to work.
> > User space is not supposed to resubmit, but still EAGAIN is returned to
> > user space, and drmIoctl() in user space just calls the came ioctl
> > again. Sounds like drmIoctl() is completely wrong.
> 
> Maybe it wasn't clear, but -EAGAIN does _not_ resubmit work. -EAGAIN
> is used to restart the ioctl if we had to kick a thread (to make sure
> it doesn't hold any locks), e.g. for a blocking wait on oustanding
> rendering. The codepaths taken work exactly as if the thread is
> interrupt with a signal.

Okay. So if I understand correctly that reserves EAGAIN for a very
specific purpose DRM-wide and therefore we can't really use it for
something Tegra-specific. Even if we wanted to side-step the issue
by not using drmIoctl(), it might be a bad idea (or even break in
some other path) if we use EAGAIN. I guess we'll have to find some
other error-code for the case where a zero timeout is given to the
syncpoint wait ioctl.

Maybe it's best to use ETIMEDOUT after, just like for the case of a
non-zero timeout and the syncpoint not being incremented in time.
Userspace should be able to differentiate between the two because it
knows what timeout it did pass to the ioctl.

Thierry

Terje Bergstrom June 12, 2013, 10:28 a.m. UTC | #10

On 11.06.2013 15:09, Daniel Vetter wrote:
> Maybe it wasn't clear, but -EAGAIN does _not_ resubmit work. -EAGAIN
> is used to restart the ioctl if we had to kick a thread (to make sure
> it doesn't hold any locks), e.g. for a blocking wait on oustanding
> rendering. The codepaths taken work exactly as if the thread is
> interrupt with a signal.

You did make it clear that there's no resubmission, but other parts
confused me.

So this is used so that a legacy driver which does not do fine-grained
locking can interrupt all waits for completion for a wedged submit. This
way a driver-wide lock get unlocked, cleanup code acquires locks, does
the magic to unwedge GPU, and unlocks. Then user space can re-submit the
waits as it got -EAGAIN.

Fortunately the error code doesn't really matter as long as it's not
EAGAIN, so we can just use ETIMEDOUT as Thierry suggested in his follow-up.

Terje

Daniel Vetter June 12, 2013, 11 a.m. UTC | #11

On Wed, Jun 12, 2013 at 12:28 PM, Terje Bergström <tbergstrom@nvidia.com> wrote:
> On 11.06.2013 15:09, Daniel Vetter wrote:
>> Maybe it wasn't clear, but -EAGAIN does _not_ resubmit work. -EAGAIN
>> is used to restart the ioctl if we had to kick a thread (to make sure
>> it doesn't hold any locks), e.g. for a blocking wait on oustanding
>> rendering. The codepaths taken work exactly as if the thread is
>> interrupt with a signal.
>
> You did make it clear that there's no resubmission, but other parts
> confused me.
>
> So this is used so that a legacy driver which does not do fine-grained
> locking can interrupt all waits for completion for a wedged submit. This
> way a driver-wide lock get unlocked, cleanup code acquires locks, does
> the magic to unwedge GPU, and unlocks. Then user space can re-submit the
> waits as it got -EAGAIN.

I think this is not just for drivers without fine-grained locking, at
least I expect that we'll keep the same mechanism when switching over
to per-object locking - we simply have too many places where a thread
could arbitrarily block while holding locks that the gpu reset handler
also needs to grab. You could of course restructure the code massively
and drop all locks while waiting, but that means adding tons of
special-purpose code which is only really exercises when a gpu hang
occurs. Our approach with the ioctl restart otoh reuses codepaths
which are all heavily used by X (due to X constantly getting interrupt
by timers and input events).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

[2/6] gpu: host1x: Fix syncpoint wait return value

Commit Message

Comments

Patch