Message ID | abba1962f03200b03c3deb16bd3e690393f4596e.1370992898.git.luto@amacapital.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: > If the device is idle for over ten seconds, then the next attempt to do > anything can race with the lockup detector and cause a bogus lockup > to be detected. > > Oddly, the situation is well-described in the lockup detector's comments > and a fix is even described. This patch implements that fix (and corrects > some typos in the description). > > My system has been stable for about a week running this code. Without this, > my screen would go blank every now and then and, when it came back, everything > would be remarkably slow (the latter is a separate bug). > > Signed-off-by: Andy Lutomirski <luto@amacapital.net> [...] > diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c > index 1ef5eaa..fb7b3ea 100644 > --- a/drivers/gpu/drm/radeon/radeon_ring.c > +++ b/drivers/gpu/drm/radeon/radeon_ring.c > @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) > * have CP rptr to a different value of jiffies wrap around which will force > * initialization of the lockup tracking informations. > * > - * A possible false positivie is if we get call after while and last_cp_rptr == > - * the current CP rptr, even if it's unlikely it might happen. To avoid this > - * if the elapsed time since last call is bigger than 2 second than we return > - * false and update the tracking information. Due to this the caller must call > - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported > - * the fencing code should be cautious about that. > + * A possible false positive is if we get called after a while and > + * last_cp_rptr == the current CP rptr, even if it's unlikely it might > + * happen. To avoid this if the elapsed time since the last call is bigger > + * than 2 second then we return false and update the tracking > + * information. Due to this the caller must call radeon_ring_test_lockup > + * more frequently than once every 2s when waiting. Is it guaranteed that radeon_ring_test_lockup will be called more often than every 2s when waiting? If not, this change might prevent a real lockup from being detected? Either way, I wonder if there might not be a simpler solution to the problem, e.g. by updating last_activity when submitting commands to a previously empty ring.
Am 12.06.2013 12:26, schrieb Michel Dänzer: > On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: >> If the device is idle for over ten seconds, then the next attempt to do >> anything can race with the lockup detector and cause a bogus lockup >> to be detected. >> >> Oddly, the situation is well-described in the lockup detector's comments >> and a fix is even described. This patch implements that fix (and corrects >> some typos in the description). >> >> My system has been stable for about a week running this code. Without this, >> my screen would go blank every now and then and, when it came back, everything >> would be remarkably slow (the latter is a separate bug). >> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net> > [...] > >> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c >> index 1ef5eaa..fb7b3ea 100644 >> --- a/drivers/gpu/drm/radeon/radeon_ring.c >> +++ b/drivers/gpu/drm/radeon/radeon_ring.c >> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) >> * have CP rptr to a different value of jiffies wrap around which will force >> * initialization of the lockup tracking informations. >> * >> - * A possible false positivie is if we get call after while and last_cp_rptr == >> - * the current CP rptr, even if it's unlikely it might happen. To avoid this >> - * if the elapsed time since last call is bigger than 2 second than we return >> - * false and update the tracking information. Due to this the caller must call >> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported >> - * the fencing code should be cautious about that. >> + * A possible false positive is if we get called after a while and >> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might >> + * happen. To avoid this if the elapsed time since the last call is bigger >> + * than 2 second then we return false and update the tracking >> + * information. Due to this the caller must call radeon_ring_test_lockup >> + * more frequently than once every 2s when waiting. > Is it guaranteed that radeon_ring_test_lockup will be called more often > than every 2s when waiting? If not, this change might prevent a real > lockup from being detected? > > Either way, I wonder if there might not be a simpler solution to the > problem, e.g. by updating last_activity when submitting commands to a > previously empty ring. If I remember correctly that's exactly what I used to do when creating radeon_ring_test_lockup, and now I'm really wondering why that stuff isn't there any more. Anyway the problem you describe should only happen very very rarely in case of a wraparound, so I'm pretty sure that we have a different problem that's just masked by that patch. Christian.
On Wed, Jun 12, 2013 at 6:26 AM, Michel Dänzer <michel@daenzer.net> wrote: > On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: >> If the device is idle for over ten seconds, then the next attempt to do >> anything can race with the lockup detector and cause a bogus lockup >> to be detected. >> >> Oddly, the situation is well-described in the lockup detector's comments >> and a fix is even described. This patch implements that fix (and corrects >> some typos in the description). >> >> My system has been stable for about a week running this code. Without this, >> my screen would go blank every now and then and, when it came back, everything >> would be remarkably slow (the latter is a separate bug). >> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net> > > [...] > >> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c >> index 1ef5eaa..fb7b3ea 100644 >> --- a/drivers/gpu/drm/radeon/radeon_ring.c >> +++ b/drivers/gpu/drm/radeon/radeon_ring.c >> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) >> * have CP rptr to a different value of jiffies wrap around which will force >> * initialization of the lockup tracking informations. >> * >> - * A possible false positivie is if we get call after while and last_cp_rptr == >> - * the current CP rptr, even if it's unlikely it might happen. To avoid this >> - * if the elapsed time since last call is bigger than 2 second than we return >> - * false and update the tracking information. Due to this the caller must call >> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported >> - * the fencing code should be cautious about that. >> + * A possible false positive is if we get called after a while and >> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might >> + * happen. To avoid this if the elapsed time since the last call is bigger >> + * than 2 second then we return false and update the tracking >> + * information. Due to this the caller must call radeon_ring_test_lockup >> + * more frequently than once every 2s when waiting. > > Is it guaranteed that radeon_ring_test_lockup will be called more often > than every 2s when waiting? If not, this change might prevent a real > lockup from being detected? Yes it will if you wait for a fence, because the fence timeout wait is way smaller than 2sec so radeon_ring_is_lockup get call several time, which call radeon_ring_force_activity and then radeon_ring_test_lockup. This also means it very very very unlikely (see below for the likely case) to have a wrap around that give last rptr same as current one. The likely case is when you have something like a long compute, then nothing is lockup but you keep filling ring with radeon_ring_force_activity but the cp is still stuck on the ib of the compute stuff so rptr does not progress. > Either way, I wonder if there might not be a simpler solution to the > problem, e.g. by updating last_activity when submitting commands to a > previously empty ring. Maybe but i still don't think it should matter. Andy can you test (without your patch) and see if it helps with your issue : http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch Cheers, Jerome
On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse <j.glisse@gmail.com> wrote: > On Wed, Jun 12, 2013 at 6:26 AM, Michel Dänzer <michel@daenzer.net> wrote: >> On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: >>> If the device is idle for over ten seconds, then the next attempt to do >>> anything can race with the lockup detector and cause a bogus lockup >>> to be detected. >>> >>> Oddly, the situation is well-described in the lockup detector's comments >>> and a fix is even described. This patch implements that fix (and corrects >>> some typos in the description). >>> >>> My system has been stable for about a week running this code. Without this, >>> my screen would go blank every now and then and, when it came back, everything >>> would be remarkably slow (the latter is a separate bug). >>> >>> Signed-off-by: Andy Lutomirski <luto@amacapital.net> >> >> [...] >> >>> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c >>> index 1ef5eaa..fb7b3ea 100644 >>> --- a/drivers/gpu/drm/radeon/radeon_ring.c >>> +++ b/drivers/gpu/drm/radeon/radeon_ring.c >>> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) >>> * have CP rptr to a different value of jiffies wrap around which will force >>> * initialization of the lockup tracking informations. >>> * >>> - * A possible false positivie is if we get call after while and last_cp_rptr == >>> - * the current CP rptr, even if it's unlikely it might happen. To avoid this >>> - * if the elapsed time since last call is bigger than 2 second than we return >>> - * false and update the tracking information. Due to this the caller must call >>> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported >>> - * the fencing code should be cautious about that. >>> + * A possible false positive is if we get called after a while and >>> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might >>> + * happen. To avoid this if the elapsed time since the last call is bigger >>> + * than 2 second then we return false and update the tracking >>> + * information. Due to this the caller must call radeon_ring_test_lockup >>> + * more frequently than once every 2s when waiting. >> >> Is it guaranteed that radeon_ring_test_lockup will be called more often >> than every 2s when waiting? If not, this change might prevent a real >> lockup from being detected? > > Yes it will if you wait for a fence, because the fence timeout wait is > way smaller than 2sec so radeon_ring_is_lockup get call several time, > which call radeon_ring_force_activity and then > radeon_ring_test_lockup. > > This also means it very very very unlikely (see below for the likely > case) to have a wrap around that give last rptr same as current one. > > The likely case is when you have something like a long compute, then > nothing is lockup but you keep filling ring with > radeon_ring_force_activity but the cp is still stuck on the ib of the > compute stuff so rptr does not progress. > >> Either way, I wonder if there might not be a simpler solution to the >> problem, e.g. by updating last_activity when submitting commands to a >> previously empty ring. > > Maybe but i still don't think it should matter. > > Andy can you test (without your patch) and see if it helps with your issue : > http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch Testing now. I'll report back in a couple of days. I don't think that long computes have anything to do with it. The bogus lockups happen when I look away from my computer for a while and then click something. I thing the graphics are usually completely idle when this happens. AFAIK I've never run an OpenCL or similar application on this system. --Andy
On Thu, Jun 13, 2013 at 2:22 PM, Andy Lutomirski <luto@amacapital.net> wrote: > On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse <j.glisse@gmail.com> wrote: >> Andy can you test (without your patch) and see if it helps with your issue : >> http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch > > Testing now. I'll report back in a couple of days. > 3.9.4 plus this patch has been completely stable for several days now. Tested-by: Andy Lutomirski <luto@amacapital.net> Can you send this to Linux and -stable? Thanks, Andy
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 8263af3..9de5778 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -652,6 +652,7 @@ struct radeon_ring { unsigned ring_free_dw; int count_dw; unsigned long last_activity; + unsigned long last_test_lockup; unsigned last_rptr; uint64_t gpu_addr; uint32_t align_mask; diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) * have CP rptr to a different value of jiffies wrap around which will force * initialization of the lockup tracking informations. * - * A possible false positivie is if we get call after while and last_cp_rptr == - * the current CP rptr, even if it's unlikely it might happen. To avoid this - * if the elapsed time since last call is bigger than 2 second than we return - * false and update the tracking information. Due to this the caller must call - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported - * the fencing code should be cautious about that. + * A possible false positive is if we get called after a while and + * last_cp_rptr == the current CP rptr, even if it's unlikely it might + * happen. To avoid this if the elapsed time since the last call is bigger + * than 2 second then we return false and update the tracking + * information. Due to this the caller must call radeon_ring_test_lockup + * more frequently than once every 2s when waiting. * * Caller should write to the ring to force CP to do something so we don't get * false positive when CP is just gived nothing to do. @@ -560,10 +560,14 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) **/ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *ring) { - unsigned long cjiffies, elapsed; + unsigned long cjiffies, elapsed, last_test; uint32_t rptr; cjiffies = jiffies; + + last_test = ring->last_test_lockup; + ring->last_test_lockup = cjiffies; + if (!time_after(cjiffies, ring->last_activity)) { /* likely a wrap around */ radeon_ring_lockup_update(ring); @@ -576,6 +580,11 @@ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *rin radeon_ring_lockup_update(ring); return false; } + if (cjiffies - last_test > 2 * HZ) { + /* Possible race -- see comment above */ + radeon_ring_lockup_update(ring); + return false; + } elapsed = jiffies_to_msecs(cjiffies - ring->last_activity); if (radeon_lockup_timeout && elapsed >= radeon_lockup_timeout) { dev_err(rdev->dev, "GPU lockup CP stall for more than %lumsec\n", elapsed);
If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected. Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description). My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug). Signed-off-by: Andy Lutomirski <luto@amacapital.net> --- This may be -stable material. drivers/gpu/drm/radeon/radeon.h | 1 + drivers/gpu/drm/radeon/radeon_ring.c | 23 ++++++++++++++++------- 2 files changed, 17 insertions(+), 7 deletions(-)