Message ID | 1368791388-31441-3-git-send-email-amerilainen@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, May 17, 2013 at 02:49:44PM +0300, Arto Merilainen wrote: > Syncpoint wait returned EAGAIN if it was called with zero timeout. > This patch modifies the function to return ETIMEDOUT. This description is a bit redundant, because it repeats in prose what the code does. I'd rather see a description of why the change is necessary. Thinking about it, maybe it would be good to have two separate error codes. Keeping -EAGAIN for the case where a zero timeout was passed doesn't sound too bad to differentiate it from the case where a non- zero timeout was passed and it actually timed out. What do you think? Thierry
On 05/26/2013 01:12 PM, Thierry Reding wrote: > * PGP Signed by an unknown key > > On Fri, May 17, 2013 at 02:49:44PM +0300, Arto Merilainen wrote: >> Syncpoint wait returned EAGAIN if it was called with zero timeout. >> This patch modifies the function to return ETIMEDOUT. > > This description is a bit redundant, because it repeats in prose what > the code does. I'd rather see a description of why the change is > necessary. True. Will fix. > > Thinking about it, maybe it would be good to have two separate error > codes. Keeping -EAGAIN for the case where a zero timeout was passed > doesn't sound too bad to differentiate it from the case where a non- > zero timeout was passed and it actually timed out. What do you think? I agree, in this case it would not look bad at all. However, user space libraries may loop until the ioctl return code is something else than -EAGAIN or -EINTR. Especially function drmIoctl() in libdrm does this which is why I noted this isssue in the first place. If user space uses zero timeout to just check if a syncpoint value has already passed the library continues looping until the syncpoint value actually passes. Of course, we could just modify the ioctl interface to "cast" this return code to something else but that does not seem correct. - Arto
On Mon, May 27, 2013 at 09:55:46AM +0300, Arto Merilainen wrote: > On 05/26/2013 01:12 PM, Thierry Reding wrote: > >* PGP Signed by an unknown key > > > >On Fri, May 17, 2013 at 02:49:44PM +0300, Arto Merilainen wrote: [...] > >Thinking about it, maybe it would be good to have two separate error > >codes. Keeping -EAGAIN for the case where a zero timeout was passed > >doesn't sound too bad to differentiate it from the case where a non- > >zero timeout was passed and it actually timed out. What do you think? > > I agree, in this case it would not look bad at all. However, user > space libraries may loop until the ioctl return code is something > else than -EAGAIN or -EINTR. Especially function drmIoctl() in > libdrm does this which is why I noted this isssue in the first > place. > > If user space uses zero timeout to just check if a syncpoint value > has already passed the library continues looping until the syncpoint > value actually passes. Of course, we could just modify the ioctl > interface to "cast" this return code to something else but that does > not seem correct. That doesn't sound right. Maybe drmIoctl() needs fixing instead. Looking at the history, drmIoctl() was introduced to automatically loop if a signal was received (commit 8b9ab108ec1f2ba2b503f713769c4946849b3cb2). However the ioctl(3p) manpage doesn't mention that ioctl() returns EAGAIN in case it is interrupted by a signal. I'm adding Keith as author of that commit and the xorg-devel mailing list on Cc to get some more eyes on this. Thierry
Thierry Reding <thierry.reding@gmail.com> writes: > That doesn't sound right. Maybe drmIoctl() needs fixing instead. Looking > at the history, drmIoctl() was introduced to automatically loop if a > signal was received (commit 8b9ab108ec1f2ba2b503f713769c4946849b3cb2). > However the ioctl(3p) manpage doesn't mention that ioctl() returns > EAGAIN in case it is interrupted by a signal. EAGAIN is being returned when the GPU is wedged to ask the application to re-submit the request, which will presumably be held until the GPU is un-wedged.
On Tue, May 28, 2013 at 01:12:12PM -0600, Keith Packard wrote: > Thierry Reding <thierry.reding@gmail.com> writes: > > > > That doesn't sound right. Maybe drmIoctl() needs fixing instead. Looking > > at the history, drmIoctl() was introduced to automatically loop if a > > signal was received (commit 8b9ab108ec1f2ba2b503f713769c4946849b3cb2). > > However the ioctl(3p) manpage doesn't mention that ioctl() returns > > EAGAIN in case it is interrupted by a signal. > > EAGAIN is being returned when the GPU is wedged to ask the application > to re-submit the request, which will presumably be held until the GPU > is un-wedged. Isn't that a bit risky? What if something special needs to be done to unwedge the GPU other than re-submit the request, or if it just can't be reasonably unwedged. In that case drmIoctl() will keep looping indefinitely. If the above is indeed the expected behaviour for drivers, then we need a different error code for the SYNCPT_WAIT ioctl. EAGAIN is the best fit and anything else doesn't quite match the use-case. A different option might be not to use drmIoctl() on Tegra. Thierry
On Tue, Jun 11, 2013 at 12:48 PM, Thierry Reding <thierry.reding@gmail.com> wrote: > On Tue, May 28, 2013 at 01:12:12PM -0600, Keith Packard wrote: >> Thierry Reding <thierry.reding@gmail.com> writes: >> >> >> > That doesn't sound right. Maybe drmIoctl() needs fixing instead. Looking >> > at the history, drmIoctl() was introduced to automatically loop if a >> > signal was received (commit 8b9ab108ec1f2ba2b503f713769c4946849b3cb2). >> > However the ioctl(3p) manpage doesn't mention that ioctl() returns >> > EAGAIN in case it is interrupted by a signal. >> >> EAGAIN is being returned when the GPU is wedged to ask the application >> to re-submit the request, which will presumably be held until the GPU >> is un-wedged. > > Isn't that a bit risky? What if something special needs to be done to > unwedge the GPU other than re-submit the request, or if it just can't > be reasonably unwedged. In that case drmIoctl() will keep looping > indefinitely. > > If the above is indeed the expected behaviour for drivers, then we need > a different error code for the SYNCPT_WAIT ioctl. EAGAIN is the best fit > and anything else doesn't quite match the use-case. A different option > might be not to use drmIoctl() on Tegra. We don't use the EAGAIN ioctl restarting to resubmit the batchbuffer which blew up the gpu (that one has been submitted already in a different ioctl call), but to be able to restart the ioctl after the reset has completed: We need to kick every thread which is potentially holding GEM locks and make sure that we block them (at a point where they don't hold any locks) until the reset handler completed. To avoid a validation nightmare we use the same codepaths as we use for signal interrupts, so ioctl restarting is a very natural fit for this. Resubmitting victim workloads when a gpu crash happened is something the reset handler would do (kernel work item currently), not any userspace process doing an ioctl. But atm we don't resubmit victimized workloads. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On 11.06.2013 14:00, Daniel Vetter wrote: > We don't use the EAGAIN ioctl restarting to resubmit the batchbuffer > which blew up the gpu (that one has been submitted already in a > different ioctl call), but to be able to restart the ioctl after the > reset has completed: We need to kick every thread which is potentially > holding GEM locks and make sure that we block them (at a point where > they don't hold any locks) until the reset handler completed. To avoid > a validation nightmare we use the same codepaths as we use for signal > interrupts, so ioctl restarting is a very natural fit for this. > > Resubmitting victim workloads when a gpu crash happened is something > the reset handler would do (kernel work item currently), not any > userspace process doing an ioctl. But atm we don't resubmit victimized > workloads. I don't understand the end-to-end of how resubmit is supposed to work. User space is not supposed to resubmit, but still EAGAIN is returned to user space, and drmIoctl() in user space just calls the came ioctl again. Sounds like drmIoctl() is completely wrong. In Tegra, when a job blows up, we reset the involved units, and set the pushbuffer pointer of host1x to point to the next job, and re-enable units. There's no need for anybody to resubmit anything, as kernel already has them. Terje
On Tue, Jun 11, 2013 at 1:43 PM, Terje Bergström <tbergstrom@nvidia.com> wrote: > On 11.06.2013 14:00, Daniel Vetter wrote: >> We don't use the EAGAIN ioctl restarting to resubmit the batchbuffer >> which blew up the gpu (that one has been submitted already in a >> different ioctl call), but to be able to restart the ioctl after the >> reset has completed: We need to kick every thread which is potentially >> holding GEM locks and make sure that we block them (at a point where >> they don't hold any locks) until the reset handler completed. To avoid >> a validation nightmare we use the same codepaths as we use for signal >> interrupts, so ioctl restarting is a very natural fit for this. >> >> Resubmitting victim workloads when a gpu crash happened is something >> the reset handler would do (kernel work item currently), not any >> userspace process doing an ioctl. But atm we don't resubmit victimized >> workloads. > > I don't understand the end-to-end of how resubmit is supposed to work. > User space is not supposed to resubmit, but still EAGAIN is returned to > user space, and drmIoctl() in user space just calls the came ioctl > again. Sounds like drmIoctl() is completely wrong. Maybe it wasn't clear, but -EAGAIN does _not_ resubmit work. -EAGAIN is used to restart the ioctl if we had to kick a thread (to make sure it doesn't hold any locks), e.g. for a blocking wait on oustanding rendering. The codepaths taken work exactly as if the thread is interrupt with a signal. > In Tegra, when a job blows up, we reset the involved units, and set the > pushbuffer pointer of host1x to point to the next job, and re-enable > units. There's no need for anybody to resubmit anything, as kernel > already has them. Yeah, that's how it works in i915.ko, too. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Tue, Jun 11, 2013 at 02:09:31PM +0200, Daniel Vetter wrote: > On Tue, Jun 11, 2013 at 1:43 PM, Terje Bergström <tbergstrom@nvidia.com> wrote: > > On 11.06.2013 14:00, Daniel Vetter wrote: > >> We don't use the EAGAIN ioctl restarting to resubmit the batchbuffer > >> which blew up the gpu (that one has been submitted already in a > >> different ioctl call), but to be able to restart the ioctl after the > >> reset has completed: We need to kick every thread which is potentially > >> holding GEM locks and make sure that we block them (at a point where > >> they don't hold any locks) until the reset handler completed. To avoid > >> a validation nightmare we use the same codepaths as we use for signal > >> interrupts, so ioctl restarting is a very natural fit for this. > >> > >> Resubmitting victim workloads when a gpu crash happened is something > >> the reset handler would do (kernel work item currently), not any > >> userspace process doing an ioctl. But atm we don't resubmit victimized > >> workloads. > > > > I don't understand the end-to-end of how resubmit is supposed to work. > > User space is not supposed to resubmit, but still EAGAIN is returned to > > user space, and drmIoctl() in user space just calls the came ioctl > > again. Sounds like drmIoctl() is completely wrong. > > Maybe it wasn't clear, but -EAGAIN does _not_ resubmit work. -EAGAIN > is used to restart the ioctl if we had to kick a thread (to make sure > it doesn't hold any locks), e.g. for a blocking wait on oustanding > rendering. The codepaths taken work exactly as if the thread is > interrupt with a signal. Okay. So if I understand correctly that reserves EAGAIN for a very specific purpose DRM-wide and therefore we can't really use it for something Tegra-specific. Even if we wanted to side-step the issue by not using drmIoctl(), it might be a bad idea (or even break in some other path) if we use EAGAIN. I guess we'll have to find some other error-code for the case where a zero timeout is given to the syncpoint wait ioctl. Maybe it's best to use ETIMEDOUT after, just like for the case of a non-zero timeout and the syncpoint not being incremented in time. Userspace should be able to differentiate between the two because it knows what timeout it did pass to the ioctl. Thierry
On 11.06.2013 15:09, Daniel Vetter wrote: > Maybe it wasn't clear, but -EAGAIN does _not_ resubmit work. -EAGAIN > is used to restart the ioctl if we had to kick a thread (to make sure > it doesn't hold any locks), e.g. for a blocking wait on oustanding > rendering. The codepaths taken work exactly as if the thread is > interrupt with a signal. You did make it clear that there's no resubmission, but other parts confused me. So this is used so that a legacy driver which does not do fine-grained locking can interrupt all waits for completion for a wedged submit. This way a driver-wide lock get unlocked, cleanup code acquires locks, does the magic to unwedge GPU, and unlocks. Then user space can re-submit the waits as it got -EAGAIN. Fortunately the error code doesn't really matter as long as it's not EAGAIN, so we can just use ETIMEDOUT as Thierry suggested in his follow-up. Terje
On Wed, Jun 12, 2013 at 12:28 PM, Terje Bergström <tbergstrom@nvidia.com> wrote: > On 11.06.2013 15:09, Daniel Vetter wrote: >> Maybe it wasn't clear, but -EAGAIN does _not_ resubmit work. -EAGAIN >> is used to restart the ioctl if we had to kick a thread (to make sure >> it doesn't hold any locks), e.g. for a blocking wait on oustanding >> rendering. The codepaths taken work exactly as if the thread is >> interrupt with a signal. > > You did make it clear that there's no resubmission, but other parts > confused me. > > So this is used so that a legacy driver which does not do fine-grained > locking can interrupt all waits for completion for a wedged submit. This > way a driver-wide lock get unlocked, cleanup code acquires locks, does > the magic to unwedge GPU, and unlocks. Then user space can re-submit the > waits as it got -EAGAIN. I think this is not just for drivers without fine-grained locking, at least I expect that we'll keep the same mechanism when switching over to per-object locking - we simply have too many places where a thread could arbitrarily block while holding locks that the gpu reset handler also needs to grab. You could of course restructure the code massively and drop all locks while waiting, but that means adding tons of special-purpose code which is only really exercises when a gpu hang occurs. Our approach with the ioctl restart otoh reuses codepaths which are all heavily used by X (due to X constantly getting interrupt by timers and input events). -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
diff --git a/drivers/gpu/host1x/syncpt.c b/drivers/gpu/host1x/syncpt.c index 4b49345..5bf5366 100644 --- a/drivers/gpu/host1x/syncpt.c +++ b/drivers/gpu/host1x/syncpt.c @@ -187,7 +187,7 @@ int host1x_syncpt_wait(struct host1x_syncpt *sp, u32 thresh, long timeout, } if (!timeout) { - err = -EAGAIN; + err = -ETIMEDOUT; goto done; } @@ -205,7 +205,7 @@ int host1x_syncpt_wait(struct host1x_syncpt *sp, u32 thresh, long timeout, if (err) goto done; - err = -EAGAIN; + err = -ETIMEDOUT; /* Caller-specified timeout may be impractically low */ if (timeout < 0) timeout = LONG_MAX;
Syncpoint wait returned EAGAIN if it was called with zero timeout. This patch modifies the function to return ETIMEDOUT. Signed-off-by: Arto Merilainen <amerilainen@nvidia.com> --- drivers/gpu/host1x/syncpt.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)