Message ID | 20161221102331.31033-1-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Dec 21, 2016 at 11:23:30AM +0100, Daniel Vetter wrote: > When writing the generic nonblocking commit code I assumed that > through clever lifetime management I can assure that the completion > (stored in drm_crtc_commit) only gets freed after it is completed. And > that worked. > > I also wanted to make nonblocking helpers resilient against driver > bugs, by having timeouts everywhere. And that worked too. > > Unfortunately taking boths things together results in oopses :( Well, > at least sometimes: What seems to happen is that the drm event hangs > around forever stuck in limbo land. The nonblocking helpers eventually > time out, move on and release it. Now the bug I tested all this > against is drivers that just entirely fail to deliver the vblank > events like they should, and in those cases the event is simply > leaked. But what seems to happen, at least sometimes, on i915 is that > the event is set up correctly, but somohow the vblank fails to fire in > time. Which means the event isn't leaked, it's still there waiting for > eventually a vblank to fire. That tends to happen when re-enabling the > pipe, and then the trap springs and the kernel oopses. > > The correct fix here is simply to refcount the crtc commit to make > sure that the event sticks around even for drivers which only > sometimes fail to deliver vblanks for some arbitrary reasons. Since > crtc commits are already refcounted that's easy to do. Or make the event a part of the atomic state? -Chris
Op 21-12-16 om 11:36 schreef Chris Wilson: > On Wed, Dec 21, 2016 at 11:23:30AM +0100, Daniel Vetter wrote: >> When writing the generic nonblocking commit code I assumed that >> through clever lifetime management I can assure that the completion >> (stored in drm_crtc_commit) only gets freed after it is completed. And >> that worked. >> >> I also wanted to make nonblocking helpers resilient against driver >> bugs, by having timeouts everywhere. And that worked too. >> >> Unfortunately taking boths things together results in oopses :( Well, >> at least sometimes: What seems to happen is that the drm event hangs >> around forever stuck in limbo land. The nonblocking helpers eventually >> time out, move on and release it. Now the bug I tested all this >> against is drivers that just entirely fail to deliver the vblank >> events like they should, and in those cases the event is simply >> leaked. But what seems to happen, at least sometimes, on i915 is that >> the event is set up correctly, but somohow the vblank fails to fire in >> time. Which means the event isn't leaked, it's still there waiting for >> evevntually a vblank to fire. That tends to happen when re-enabling the >> pipe, and then the trap springs and the kernel oopses. >> >> The correct fix here is simply to refcount the crtc commit to make >> sure that the event sticks around even for drivers which only >> sometimes fail to deliver vblanks for some arbitrary reasons. Since >> crtc commits are already refcounted that's easy to do. > Or make the event a part of the atomic state? > -Chris > afaict crtc commit is already taken to wait for completion, so this patch makes sense. There's just a minor typo in the subject. :) Not sure that release_commit should be a function pointer, regardless.. Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
On Wed, Dec 21, 2016 at 10:36:41AM +0000, Chris Wilson wrote: > On Wed, Dec 21, 2016 at 11:23:30AM +0100, Daniel Vetter wrote: > > When writing the generic nonblocking commit code I assumed that > > through clever lifetime management I can assure that the completion > > (stored in drm_crtc_commit) only gets freed after it is completed. And > > that worked. > > > > I also wanted to make nonblocking helpers resilient against driver > > bugs, by having timeouts everywhere. And that worked too. > > > > Unfortunately taking boths things together results in oopses :( Well, > > at least sometimes: What seems to happen is that the drm event hangs > > around forever stuck in limbo land. The nonblocking helpers eventually > > time out, move on and release it. Now the bug I tested all this > > against is drivers that just entirely fail to deliver the vblank > > events like they should, and in those cases the event is simply > > leaked. But what seems to happen, at least sometimes, on i915 is that > > the event is set up correctly, but somohow the vblank fails to fire in > > time. Which means the event isn't leaked, it's still there waiting for > > eventually a vblank to fire. That tends to happen when re-enabling the > > pipe, and then the trap springs and the kernel oopses. > > > > The correct fix here is simply to refcount the crtc commit to make > > sure that the event sticks around even for drivers which only > > sometimes fail to deliver vblanks for some arbitrary reasons. Since > > crtc commits are already refcounted that's easy to do. > > Or make the event a part of the atomic state? I guess we could do that, but I wanted the most minimal thing for backporting. And reference-counted atomic state is new, and the patch would be a bit bigger. -Daniel
On Wed, Dec 21, 2016 at 12:08:45PM +0100, Maarten Lankhorst wrote: > Op 21-12-16 om 11:36 schreef Chris Wilson: > > On Wed, Dec 21, 2016 at 11:23:30AM +0100, Daniel Vetter wrote: > >> When writing the generic nonblocking commit code I assumed that > >> through clever lifetime management I can assure that the completion > >> (stored in drm_crtc_commit) only gets freed after it is completed. And > >> that worked. > >> > >> I also wanted to make nonblocking helpers resilient against driver > >> bugs, by having timeouts everywhere. And that worked too. > >> > >> Unfortunately taking boths things together results in oopses :( Well, > >> at least sometimes: What seems to happen is that the drm event hangs > >> around forever stuck in limbo land. The nonblocking helpers eventually > >> time out, move on and release it. Now the bug I tested all this > >> against is drivers that just entirely fail to deliver the vblank > >> events like they should, and in those cases the event is simply > >> leaked. But what seems to happen, at least sometimes, on i915 is that > >> the event is set up correctly, but somohow the vblank fails to fire in > >> time. Which means the event isn't leaked, it's still there waiting for > >> evevntually a vblank to fire. That tends to happen when re-enabling the > >> pipe, and then the trap springs and the kernel oopses. > >> > >> The correct fix here is simply to refcount the crtc commit to make > >> sure that the event sticks around even for drivers which only > >> sometimes fail to deliver vblanks for some arbitrary reasons. Since > >> crtc commits are already refcounted that's easy to do. > > Or make the event a part of the atomic state? > > -Chris > > > afaict crtc commit is already taken to wait for completion, so this patch makes sense. > > There's just a minor typo in the subject. :) > Not sure that release_commit should be a function pointer, regardless.. > > Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> It didn't help the bug reporters against oopses (but the reporters are supremely confusing, I have no idea what's really being tested, the bugzilla is a mess), but I still think the patch is useful for more robuestness, I dropped the cc: stable and applied it to drm-misc. Thanks for the review. -Daniel
On Wed, 04 Jan 2017, Daniel Vetter <daniel@ffwll.ch> wrote: > On Wed, Dec 21, 2016 at 12:08:45PM +0100, Maarten Lankhorst wrote: >> Op 21-12-16 om 11:36 schreef Chris Wilson: >> > On Wed, Dec 21, 2016 at 11:23:30AM +0100, Daniel Vetter wrote: >> >> When writing the generic nonblocking commit code I assumed that >> >> through clever lifetime management I can assure that the completion >> >> (stored in drm_crtc_commit) only gets freed after it is completed. And >> >> that worked. >> >> >> >> I also wanted to make nonblocking helpers resilient against driver >> >> bugs, by having timeouts everywhere. And that worked too. >> >> >> >> Unfortunately taking boths things together results in oopses :( Well, >> >> at least sometimes: What seems to happen is that the drm event hangs >> >> around forever stuck in limbo land. The nonblocking helpers eventually >> >> time out, move on and release it. Now the bug I tested all this >> >> against is drivers that just entirely fail to deliver the vblank >> >> events like they should, and in those cases the event is simply >> >> leaked. But what seems to happen, at least sometimes, on i915 is that >> >> the event is set up correctly, but somohow the vblank fails to fire in >> >> time. Which means the event isn't leaked, it's still there waiting for >> >> evevntually a vblank to fire. That tends to happen when re-enabling the >> >> pipe, and then the trap springs and the kernel oopses. >> >> >> >> The correct fix here is simply to refcount the crtc commit to make >> >> sure that the event sticks around even for drivers which only >> >> sometimes fail to deliver vblanks for some arbitrary reasons. Since >> >> crtc commits are already refcounted that's easy to do. >> > Or make the event a part of the atomic state? >> > -Chris >> > >> afaict crtc commit is already taken to wait for completion, so this patch makes sense. >> >> There's just a minor typo in the subject. :) >> Not sure that release_commit should be a function pointer, regardless.. >> >> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> > > It didn't help the bug reporters against oopses (but the reporters are > supremely confusing, I have no idea what's really being tested, the > bugzilla is a mess), but I still think the patch is useful for more > robuestness, I dropped the cc: stable and applied it to drm-misc. Agreed on the bug [1] being a mess. However, the bug has a reliable bisect result, the revert was posted by some of the reporters on the lists and in the bug, and now something that will not help anyone in v4.9 or v4.10 was pushed. :( BR, Jani. [1] https://bugs.freedesktop.org/show_bug.cgi?id=96781
On Thu, Feb 09, 2017 at 04:39:29PM +0200, Jani Nikula wrote: > On Wed, 04 Jan 2017, Daniel Vetter <daniel@ffwll.ch> wrote: > > On Wed, Dec 21, 2016 at 12:08:45PM +0100, Maarten Lankhorst wrote: > >> Op 21-12-16 om 11:36 schreef Chris Wilson: > >> > On Wed, Dec 21, 2016 at 11:23:30AM +0100, Daniel Vetter wrote: > >> >> When writing the generic nonblocking commit code I assumed that > >> >> through clever lifetime management I can assure that the completion > >> >> (stored in drm_crtc_commit) only gets freed after it is completed. And > >> >> that worked. > >> >> > >> >> I also wanted to make nonblocking helpers resilient against driver > >> >> bugs, by having timeouts everywhere. And that worked too. > >> >> > >> >> Unfortunately taking boths things together results in oopses :( Well, > >> >> at least sometimes: What seems to happen is that the drm event hangs > >> >> around forever stuck in limbo land. The nonblocking helpers eventually > >> >> time out, move on and release it. Now the bug I tested all this > >> >> against is drivers that just entirely fail to deliver the vblank > >> >> events like they should, and in those cases the event is simply > >> >> leaked. But what seems to happen, at least sometimes, on i915 is that > >> >> the event is set up correctly, but somohow the vblank fails to fire in > >> >> time. Which means the event isn't leaked, it's still there waiting for > >> >> evevntually a vblank to fire. That tends to happen when re-enabling the > >> >> pipe, and then the trap springs and the kernel oopses. > >> >> > >> >> The correct fix here is simply to refcount the crtc commit to make > >> >> sure that the event sticks around even for drivers which only > >> >> sometimes fail to deliver vblanks for some arbitrary reasons. Since > >> >> crtc commits are already refcounted that's easy to do. > >> > Or make the event a part of the atomic state? > >> > -Chris > >> > > >> afaict crtc commit is already taken to wait for completion, so this patch makes sense. > >> > >> There's just a minor typo in the subject. :) > >> Not sure that release_commit should be a function pointer, regardless.. > >> > >> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> > > > > It didn't help the bug reporters against oopses (but the reporters are > > supremely confusing, I have no idea what's really being tested, the > > bugzilla is a mess), but I still think the patch is useful for more > > robuestness, I dropped the cc: stable and applied it to drm-misc. > > Agreed on the bug [1] being a mess. However, the bug has a reliable > bisect result, the revert was posted by some of the reporters on the > lists and in the bug, and now something that will not help anyone in > v4.9 or v4.10 was pushed. :( Latest report just says that the revert isn't helping either. I suspect the report is a giantic conflagration of everything ever that kills various reporters boxes. I still believe that the patch here fixes the original bug, but there might be a lot more hiding. It's at least seen quite a pile of testing, so I think it's sounds, and we could cherry-pick it to dinf with cc: stable for 4.9+. Worst case it's not going to help for the other problems. -Daniel
Daniel Vetter wrote: Latest report just says that the revert isn't helping either. I suspect the report is a giantic conflagration of everything ever that kills various reporters boxes. I still believe that the patch here fixes the original bug, but there might be a lot more hiding. It's at least seen quite a pile of testing, so I think it's sounds, and we could cherry-pick it to dinf with cc: stable for 4.9+. Worst case it's not going to help for the other problems. No, that's not what the latest report says. It says, "running for 2 weeks ... This is certainly way, way better than the current stock experience, which results in my T460s entirely locking up daily." and "Less than a day after I made that comment I got a hard lockup". So reverting the buggy helper nonblock tracking commit took this reporter from locking up daily to locking up once in two weeks. For everyone else, reverting the buggy commit fixes all bugs. Also note that this most recent lockup appears to be a different bug ("GPU HANG: ecode"). So we have a commit that is causing hard lockups and flip_done timeouts for multiple users. Reverting this commit fixes the problem. But we did not push the revert up for 4.9, and it looks like we're not going to push it up for 4.10 either.
diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c index 799c1564a4f8..b4dfd1e1a4f0 100644 --- a/drivers/gpu/drm/drm_atomic_helper.c +++ b/drivers/gpu/drm/drm_atomic_helper.c @@ -1355,6 +1355,15 @@ static int stall_checks(struct drm_crtc *crtc, bool nonblock) return ret < 0 ? ret : 0; } +void release_crtc_commit(struct completion *completion) +{ + struct drm_crtc_commit *commit = container_of(completion, + typeof(*commit), + flip_done); + + drm_crtc_commit_put(commit); +} + /** * drm_atomic_helper_setup_commit - setup possibly nonblocking commit * @state: new modeset state to be committed @@ -1447,6 +1456,8 @@ int drm_atomic_helper_setup_commit(struct drm_atomic_state *state, } crtc_state->event->base.completion = &commit->flip_done; + crtc_state->event->base.completion_release = release_crtc_commit; + drm_crtc_commit_get(commit); } return 0; diff --git a/drivers/gpu/drm/drm_fops.c b/drivers/gpu/drm/drm_fops.c index 48e106557c92..e22645375e60 100644 --- a/drivers/gpu/drm/drm_fops.c +++ b/drivers/gpu/drm/drm_fops.c @@ -689,8 +689,8 @@ void drm_send_event_locked(struct drm_device *dev, struct drm_pending_event *e) assert_spin_locked(&dev->event_lock); if (e->completion) { - /* ->completion might disappear as soon as it signalled. */ complete_all(e->completion); + e->completion_release(e->completion); e->completion = NULL; } diff --git a/include/drm/drmP.h b/include/drm/drmP.h index a9cfd33c7b1a..e821a8f142d9 100644 --- a/include/drm/drmP.h +++ b/include/drm/drmP.h @@ -360,6 +360,7 @@ struct drm_ioctl_desc { /* Event queued up for userspace to read */ struct drm_pending_event { struct completion *completion; + void (*completion_release)(struct completion *completion); struct drm_event *event; struct dma_fence *fence; struct list_head link;
When writing the generic nonblocking commit code I assumed that through clever lifetime management I can assure that the completion (stored in drm_crtc_commit) only gets freed after it is completed. And that worked. I also wanted to make nonblocking helpers resilient against driver bugs, by having timeouts everywhere. And that worked too. Unfortunately taking boths things together results in oopses :( Well, at least sometimes: What seems to happen is that the drm event hangs around forever stuck in limbo land. The nonblocking helpers eventually time out, move on and release it. Now the bug I tested all this against is drivers that just entirely fail to deliver the vblank events like they should, and in those cases the event is simply leaked. But what seems to happen, at least sometimes, on i915 is that the event is set up correctly, but somohow the vblank fails to fire in time. Which means the event isn't leaked, it's still there waiting for eventually a vblank to fire. That tends to happen when re-enabling the pipe, and then the trap springs and the kernel oopses. The correct fix here is simply to refcount the crtc commit to make sure that the event sticks around even for drivers which only sometimes fail to deliver vblanks for some arbitrary reasons. Since crtc commits are already refcounted that's easy to do. References: https://bugs.freedesktop.org/show_bug.cgi?id=96781 Cc: Jim Rees <rees@umich.edu> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- drivers/gpu/drm/drm_atomic_helper.c | 11 +++++++++++ drivers/gpu/drm/drm_fops.c | 2 +- include/drm/drmP.h | 1 + 3 files changed, 13 insertions(+), 1 deletion(-)