diff mbox

drm/i915: Keep ring->active_list and ring->requests_list consistent

Message ID 1426702762-6490-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson March 18, 2015, 6:19 p.m. UTC
If we retire requests last, we may use a later seqno and so clear
the requests lists without clearing the active list, leading to
confusion. Hence we should retire requests first for consistency with
the early return. The order used to be important as the lifecycle for
the object on the active list was determined by request->seqno. However,
the requests themselves are now reference counted removing the
constraint from the order of retirement.

Fixes regression from

commit 1b5a433a4dd967b125131da42b89b5cc0d5b1f57
Author: John Harrison <John.C.Harrison@Intel.com>
Date:   Mon Nov 24 18:49:42 2014 +0000

    drm/i915: Convert 'i915_seqno_passed' calls into 'i915_gem_request_completed
'

and a

	WARNING: CPU: 0 PID: 1383 at drivers/gpu/drm/i915/i915_gem_evict.c:279 i915_gem_evict_vm+0x10c/0x140()
	WARN_ON(!list_empty(&vm->active_list))

Identified by updating WATCH_LISTS:

	[drm:i915_verify_lists] *ERROR* blitter ring: active list not empty, but no requests
	WARNING: CPU: 0 PID: 681 at drivers/gpu/drm/i915/i915_gem.c:2751 i915_gem_retire_requests_ring+0x149/0x230()
	WARN_ON(i915_verify_lists(ring->dev))

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_gem.c | 38 +++++++++++++++++++++-----------------
 1 file changed, 21 insertions(+), 17 deletions(-)

Comments

Shuang He March 19, 2015, 11:18 a.m. UTC | #1
Tested-By: PRC QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 5996
-------------------------------------Summary-------------------------------------
Platform          Delta          drm-intel-nightly          Series Applied
PNV                                  272/272              272/272
ILK                                  301/301              301/301
SNB                                  303/303              303/303
IVB                                  342/342              342/342
BYT                                  287/287              287/287
HSW                                  362/362              362/362
BDW                                  308/308              308/308
-------------------------------------Detailed-------------------------------------
Platform  Test                                drm-intel-nightly          Series Applied
Note: You need to pay more attention to line start with '*'
Daniel Vetter March 19, 2015, 5:37 p.m. UTC | #2
On Wed, Mar 18, 2015 at 06:19:22PM +0000, Chris Wilson wrote:
> If we retire requests last, we may use a later seqno and so clear
> the requests lists without clearing the active list, leading to
> confusion. Hence we should retire requests first for consistency with
> the early return. The order used to be important as the lifecycle for
> the object on the active list was determined by request->seqno. However,
> the requests themselves are now reference counted removing the
> constraint from the order of retirement.
> 
> Fixes regression from
> 
> commit 1b5a433a4dd967b125131da42b89b5cc0d5b1f57
> Author: John Harrison <John.C.Harrison@Intel.com>
> Date:   Mon Nov 24 18:49:42 2014 +0000
> 
>     drm/i915: Convert 'i915_seqno_passed' calls into 'i915_gem_request_completed
> '
> 
> and a
> 
> 	WARNING: CPU: 0 PID: 1383 at drivers/gpu/drm/i915/i915_gem_evict.c:279 i915_gem_evict_vm+0x10c/0x140()
> 	WARN_ON(!list_empty(&vm->active_list))

How does this come about - we call gpu_idle before this seems to blow up,
so all requests should be completed? And I don't think we can blame this
on racy seqno signalling, since gpu_idle does all the waiting already ...

> Identified by updating WATCH_LISTS:
> 
> 	[drm:i915_verify_lists] *ERROR* blitter ring: active list not empty, but no requests
> 	WARNING: CPU: 0 PID: 681 at drivers/gpu/drm/i915/i915_gem.c:2751 i915_gem_retire_requests_ring+0x149/0x230()
> 	WARN_ON(i915_verify_lists(ring->dev))
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>

Since we've just discussed this on irc: Doesn't this now enshrine that
every bo needs to hold a full request?
-Daniel
Chris Wilson March 19, 2015, 10:17 p.m. UTC | #3
On Thu, Mar 19, 2015 at 06:37:28PM +0100, Daniel Vetter wrote:
> On Wed, Mar 18, 2015 at 06:19:22PM +0000, Chris Wilson wrote:
> > 	WARNING: CPU: 0 PID: 1383 at drivers/gpu/drm/i915/i915_gem_evict.c:279 i915_gem_evict_vm+0x10c/0x140()
> > 	WARN_ON(!list_empty(&vm->active_list))
> 
> How does this come about - we call gpu_idle before this seems to blow up,
> so all requests should be completed?

Honestly, I couldn't figure it out either. I had an epiphany when I saw
that we could now have an empty request list but non-empty active list
added a test to detect when that happens and shouted eureka when the
WARN fired. I could trigger the WARN in evict_vm pretty reliably, but
not since this patch. It could just be masking another bug.

> And I don't think we can blame this
> on racy seqno signalling, since gpu_idle does all the waiting already ...
> 
> > Identified by updating WATCH_LISTS:
> > 
> > 	[drm:i915_verify_lists] *ERROR* blitter ring: active list not empty, but no requests
> > 	WARNING: CPU: 0 PID: 681 at drivers/gpu/drm/i915/i915_gem.c:2751 i915_gem_retire_requests_ring+0x149/0x230()
> > 	WARN_ON(i915_verify_lists(ring->dev))
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: John Harrison <John.C.Harrison@Intel.com>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> Since we've just discussed this on irc: Doesn't this now enshrine that
> every bo needs to hold a full request?

I'm not following. The bo hold a reference to requests, so we know we
can iterate the ring->request_list and the ring->active_list
independently. There is a challenge in doing the execbuf with as few
kref as possible, but there is also the question of whether this
particular function should go back to the previous behaviour of batching
the completion evaluation for all requests such that they are evaluated
consistently. One way without killing the abstraction entirely would be
to evaluate the i915_request_complete() only for the request_list and
then use the cached completion value for the active_list.
-Chris
Daniel Vetter March 20, 2015, 10:06 a.m. UTC | #4
On Thu, Mar 19, 2015 at 10:17:42PM +0000, Chris Wilson wrote:
> On Thu, Mar 19, 2015 at 06:37:28PM +0100, Daniel Vetter wrote:
> > On Wed, Mar 18, 2015 at 06:19:22PM +0000, Chris Wilson wrote:
> > > 	WARNING: CPU: 0 PID: 1383 at drivers/gpu/drm/i915/i915_gem_evict.c:279 i915_gem_evict_vm+0x10c/0x140()
> > > 	WARN_ON(!list_empty(&vm->active_list))
> > 
> > How does this come about - we call gpu_idle before this seems to blow up,
> > so all requests should be completed?
> 
> Honestly, I couldn't figure it out either. I had an epiphany when I saw
> that we could now have an empty request list but non-empty active list
> added a test to detect when that happens and shouted eureka when the
> WARN fired. I could trigger the WARN in evict_vm pretty reliably, but
> not since this patch. It could just be masking another bug.

Can you perhaps double-check the theory by putting a
WARN_ON(list_empty(active_list) != list_empyt(request_list)) into
gpu_idle? Ofc with this patch reverted so that the bug surfaces again.


Really strange indeed.

> > And I don't think we can blame this
> > on racy seqno signalling, since gpu_idle does all the waiting already ...
> > 
> > > Identified by updating WATCH_LISTS:
> > > 
> > > 	[drm:i915_verify_lists] *ERROR* blitter ring: active list not empty, but no requests
> > > 	WARNING: CPU: 0 PID: 681 at drivers/gpu/drm/i915/i915_gem.c:2751 i915_gem_retire_requests_ring+0x149/0x230()
> > > 	WARN_ON(i915_verify_lists(ring->dev))
> > > 
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: John Harrison <John.C.Harrison@Intel.com>
> > > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > 
> > Since we've just discussed this on irc: Doesn't this now enshrine that
> > every bo needs to hold a full request?
> 
> I'm not following. The bo hold a reference to requests, so we know we
> can iterate the ring->request_list and the ring->active_list
> independently. There is a challenge in doing the execbuf with as few
> kref as possible, but there is also the question of whether this
> particular function should go back to the previous behaviour of batching
> the completion evaluation for all requests such that they are evaluated
> consistently. One way without killing the abstraction entirely would be
> to evaluate the i915_request_complete() only for the request_list and
> then use the cached completion value for the active_list.

Yeah I meant the kref batching the old scheme would have allowed. I guess
better to figure this one out first completely before we dig into
micro-optimizations again.
-Daniel
Chris Wilson March 20, 2015, 1:02 p.m. UTC | #5
On Fri, Mar 20, 2015 at 11:06:57AM +0100, Daniel Vetter wrote:
> On Thu, Mar 19, 2015 at 10:17:42PM +0000, Chris Wilson wrote:
> > On Thu, Mar 19, 2015 at 06:37:28PM +0100, Daniel Vetter wrote:
> > > On Wed, Mar 18, 2015 at 06:19:22PM +0000, Chris Wilson wrote:
> > > > 	WARNING: CPU: 0 PID: 1383 at drivers/gpu/drm/i915/i915_gem_evict.c:279 i915_gem_evict_vm+0x10c/0x140()
> > > > 	WARN_ON(!list_empty(&vm->active_list))
> > > 
> > > How does this come about - we call gpu_idle before this seems to blow up,
> > > so all requests should be completed?
> > 
> > Honestly, I couldn't figure it out either. I had an epiphany when I saw
> > that we could now have an empty request list but non-empty active list
> > added a test to detect when that happens and shouted eureka when the
> > WARN fired. I could trigger the WARN in evict_vm pretty reliably, but
> > not since this patch. It could just be masking another bug.
> 
> Can you perhaps double-check the theory by putting a
> WARN_ON(list_empty(active_list) != list_empyt(request_list)) into
> gpu_idle? Ofc with this patch reverted so that the bug surfaces again.

[ 5215.567573] [drm:i915_verify_lists] *ERROR* render ring: active list not empty, but no requests
[ 5215.567586] ------------[ cut here ]------------
[ 5215.567598] WARNING: CPU: 0 PID: 1304 at drivers/gpu/drm/i915/i915_gem.c:3166 i915_gpu_idle+0x88/0x90()
[ 5215.567602] WARN_ON(i915_verify_lists(dev))
[ 5215.567606] Modules linked in: ctr ccm arc4 ath9k ath9k_common ath9k_hw bnep ath mac80211 rfcomm snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec uvcvideo snd_hwdep snd_pcm gpio_ich videobuf2_vmalloc dell_wmi cfg80211 videobuf2_memops sparse_keymap videobuf2_core dell_laptop snd_seq_midi v4l2_common dcdbas snd_seq_midi_event btusb videodev i8k snd_rawmidi snd_seq hid_multitouch coretemp bluetooth microcode snd_seq_device joydev snd_timer serio_raw snd shpchp soundcore wmi lpc_ich usbhid hid psmouse ahci libahci
[ 5215.567708] CPU: 0 PID: 1304 Comm: Xorg Tainted: G        W  OE   4.0.0-rc4+ #108
[ 5215.567713] Hardware name: Dell Inc. Inspiron 1090/Inspiron 1090, BIOS A06 08/23/2011
[ 5215.567718]  00000000 00000000 f46e1b98 c16b3e19 f46e1bd8 f46e1bc8 c1047f17 c1937e78
[ 5215.567733]  f46e1bf4 00000518 c1937cec 00000c5e c14441e8 c14441e8 e733bdc8 00000000
[ 5215.567747]  f6346c00 f46e1be0 c1047f83 00000009 f46e1bd8 c1937e78 f46e1bf4 f46e1c00
[ 5215.567762] Call Trace:
[ 5215.567776]  [<c16b3e19>] dump_stack+0x41/0x52
[ 5215.567788]  [<c1047f17>] warn_slowpath_common+0x87/0xc0
[ 5215.567797]  [<c14441e8>] ? i915_gpu_idle+0x88/0x90
[ 5215.567805]  [<c14441e8>] ? i915_gpu_idle+0x88/0x90
[ 5215.567815]  [<c1047f83>] warn_slowpath_fmt+0x33/0x40
[ 5215.567823]  [<c14441e8>] i915_gpu_idle+0x88/0x90
[ 5215.567833]  [<c1439949>] i915_gem_evict_something+0x269/0x300
[ 5215.567843]  [<c144754f>] i915_gem_object_do_pin+0x6ef/0xb20
[ 5215.567854]  [<c14479c5>] i915_gem_object_pin+0x45/0x50
[ 5215.567864]  [<c1439f08>] i915_gem_execbuffer_reserve_vma.isra.13+0x78/0x180
[ 5215.567874]  [<c143a2e5>] i915_gem_execbuffer_reserve+0x2d5/0x320
[ 5215.567884]  [<c11594cd>] ? __kmalloc+0x14d/0x190
[ 5215.567894]  [<c143b6d9>] i915_gem_do_execbuffer.isra.17+0x5c9/0xdd0
[ 5215.567906]  [<c112efdb>] ? vm_mmap_pgoff+0x7b/0xa0
[ 5215.567915]  [<c11594cd>] ? __kmalloc+0x14d/0x190
[ 5215.567925]  [<c143cfeb>] i915_gem_execbuffer2+0x8b/0x2c0
[ 5215.567934]  [<c143cf60>] ? i915_gem_execbuffer+0x4e0/0x4e0
[ 5215.567944]  [<c1401d67>] drm_ioctl+0x1b7/0x510
[ 5215.567954]  [<c1120a9a>] ? balance_dirty_pages_ratelimited+0x1a/0x6a0
[ 5215.567963]  [<c143cf60>] ? i915_gem_execbuffer+0x4e0/0x4e0
[ 5215.567975]  [<c113cef9>] ? handle_mm_fault+0x329/0x1250
[ 5215.567984]  [<c1401bb0>] ? drm_getmap+0xb0/0xb0
[ 5215.567994]  [<c117d9ca>] do_vfs_ioctl+0x30a/0x530
[ 5215.568005]  [<c10a9e92>] ? ktime_get_ts64+0x52/0x1a0
[ 5215.568095]  [<c1185f62>] ? __fget_light+0x22/0x60
[ 5215.568136]  [<c117dc50>] SyS_ioctl+0x60/0x90
[ 5215.568175]  [<c16b9bc8>] sysenter_do_call+0x12/0x12
[ 5215.568198] ---[ end trace ab3f7e4953cb9eb6 ]---
[ 5215.568272] ------------[ cut here ]------------
[ 5215.568288] WARNING: CPU: 0 PID: 1304 at drivers/gpu/drm/i915/i915_gem_evict.c:283 i915_gem_evict_vm+0x10c/0x140()
[ 5215.568292] WARN_ON(!list_empty(&vm->active_list))
[ 5215.568296] Modules linked in: ctr ccm arc4 ath9k ath9k_common ath9k_hw bnep ath mac80211 rfcomm snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec uvcvideo snd_hwdep snd_pcm gpio_ich videobuf2_vmalloc dell_wmi cfg80211 videobuf2_memops sparse_keymap videobuf2_core dell_laptop snd_seq_midi v4l2_common dcdbas snd_seq_midi_event btusb videodev i8k snd_rawmidi snd_seq hid_multitouch coretemp bluetooth microcode snd_seq_device joydev snd_timer serio_raw snd shpchp soundcore wmi lpc_ich usbhid hid psmouse ahci libahci
[ 5215.568383] CPU: 0 PID: 1304 Comm: Xorg Tainted: G        W  OE   4.0.0-rc4+ #108
[ 5215.568388] Hardware name: Dell Inc. Inspiron 1090/Inspiron 1090, BIOS A06 08/23/2011
[ 5215.568393]  00000000 00000000 f46e1cc0 c16b3e19 f46e1d00 f46e1cf0 c1047f17 c193712c
[ 5215.568407]  f46e1d1c 00000518 c19370d0 0000011b c1439c6c c1439c6c f3b225b0 e733c3ec
[ 5215.568421]  00000001 f46e1d08 c1047f83 00000009 f46e1d00 c193712c f46e1d1c f46e1d28
[ 5215.568435] Call Trace:
[ 5215.568445]  [<c16b3e19>] dump_stack+0x41/0x52
[ 5215.568455]  [<c1047f17>] warn_slowpath_common+0x87/0xc0
[ 5215.568465]  [<c1439c6c>] ? i915_gem_evict_vm+0x10c/0x140
[ 5215.568474]  [<c1439c6c>] ? i915_gem_evict_vm+0x10c/0x140
[ 5215.568483]  [<c1047f83>] warn_slowpath_fmt+0x33/0x40
[ 5215.568492]  [<c1439c6c>] i915_gem_evict_vm+0x10c/0x140
[ 5215.568502]  [<c143a236>] i915_gem_execbuffer_reserve+0x226/0x320
[ 5215.568511]  [<c11594cd>] ? __kmalloc+0x14d/0x190
[ 5215.568521]  [<c143b6d9>] i915_gem_do_execbuffer.isra.17+0x5c9/0xdd0
[ 5215.568532]  [<c112efdb>] ? vm_mmap_pgoff+0x7b/0xa0
[ 5215.568541]  [<c11594cd>] ? __kmalloc+0x14d/0x190
[ 5215.568550]  [<c143cfeb>] i915_gem_execbuffer2+0x8b/0x2c0
[ 5215.568560]  [<c143cf60>] ? i915_gem_execbuffer+0x4e0/0x4e0
[ 5215.568568]  [<c1401d67>] drm_ioctl+0x1b7/0x510
[ 5215.568577]  [<c1120a9a>] ? balance_dirty_pages_ratelimited+0x1a/0x6a0
[ 5215.568587]  [<c143cf60>] ? i915_gem_execbuffer+0x4e0/0x4e0
[ 5215.568599]  [<c113cef9>] ? handle_mm_fault+0x329/0x1250
[ 5215.568607]  [<c1401bb0>] ? drm_getmap+0xb0/0xb0
[ 5215.568616]  [<c117d9ca>] do_vfs_ioctl+0x30a/0x530
[ 5215.568626]  [<c10a9e92>] ? ktime_get_ts64+0x52/0x1a0
[ 5215.568635]  [<c1185f62>] ? __fget_light+0x22/0x60
[ 5215.568644]  [<c117dc50>] SyS_ioctl+0x60/0x90
[ 5215.568653]  [<c16b9bc8>] sysenter_do_call+0x12/0x12
[ 5215.568659] ---[ end trace ab3f7e4953cb9eb7 ]---
Chris Wilson March 20, 2015, 1:39 p.m. UTC | #6
On Fri, Mar 20, 2015 at 01:02:10PM +0000, Chris Wilson wrote:
> On Fri, Mar 20, 2015 at 11:06:57AM +0100, Daniel Vetter wrote:
> > On Thu, Mar 19, 2015 at 10:17:42PM +0000, Chris Wilson wrote:
> > > On Thu, Mar 19, 2015 at 06:37:28PM +0100, Daniel Vetter wrote:
> > > > On Wed, Mar 18, 2015 at 06:19:22PM +0000, Chris Wilson wrote:
> > > > > 	WARNING: CPU: 0 PID: 1383 at drivers/gpu/drm/i915/i915_gem_evict.c:279 i915_gem_evict_vm+0x10c/0x140()
> > > > > 	WARN_ON(!list_empty(&vm->active_list))
> > > > 
> > > > How does this come about - we call gpu_idle before this seems to blow up,
> > > > so all requests should be completed?
> > > 
> > > Honestly, I couldn't figure it out either. I had an epiphany when I saw
> > > that we could now have an empty request list but non-empty active list
> > > added a test to detect when that happens and shouted eureka when the
> > > WARN fired. I could trigger the WARN in evict_vm pretty reliably, but
> > > not since this patch. It could just be masking another bug.
> > 
> > Can you perhaps double-check the theory by putting a
> > WARN_ON(list_empty(active_list) != list_empyt(request_list)) into
> > gpu_idle? Ofc with this patch reverted so that the bug surfaces again.
> 
> [ 5215.567573] [drm:i915_verify_lists] *ERROR* render ring: active list not empty, but no requests
> [ 5215.567586] ------------[ cut here ]------------
> [ 5215.567598] WARNING: CPU: 0 PID: 1304 at drivers/gpu/drm/i915/i915_gem.c:3166 i915_gpu_idle+0x88/0x90()
> [ 5215.567602] WARN_ON(i915_verify_lists(dev))
> [ 5215.567606] Modules linked in: ctr ccm arc4 ath9k ath9k_common ath9k_hw bnep ath mac80211 rfcomm snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec uvcvideo snd_hwdep snd_pcm gpio_ich videobuf2_vmalloc dell_wmi cfg80211 videobuf2_memops sparse_keymap videobuf2_core dell_laptop snd_seq_midi v4l2_common dcdbas snd_seq_midi_event btusb videodev i8k snd_rawmidi snd_seq hid_multitouch coretemp bluetooth microcode snd_seq_device joydev snd_timer serio_raw snd shpchp soundcore wmi lpc_ich usbhid hid psmouse ahci libahci
> [ 5215.567708] CPU: 0 PID: 1304 Comm: Xorg Tainted: G        W  OE   4.0.0-rc4+ #108
> [ 5215.567713] Hardware name: Dell Inc. Inspiron 1090/Inspiron 1090, BIOS A06 08/23/2011
> [ 5215.567718]  00000000 00000000 f46e1b98 c16b3e19 f46e1bd8 f46e1bc8 c1047f17 c1937e78
> [ 5215.567733]  f46e1bf4 00000518 c1937cec 00000c5e c14441e8 c14441e8 e733bdc8 00000000
> [ 5215.567747]  f6346c00 f46e1be0 c1047f83 00000009 f46e1bd8 c1937e78 f46e1bf4 f46e1c00
> [ 5215.567762] Call Trace:
> [ 5215.567776]  [<c16b3e19>] dump_stack+0x41/0x52
> [ 5215.567788]  [<c1047f17>] warn_slowpath_common+0x87/0xc0
> [ 5215.567797]  [<c14441e8>] ? i915_gpu_idle+0x88/0x90
> [ 5215.567805]  [<c14441e8>] ? i915_gpu_idle+0x88/0x90
> [ 5215.567815]  [<c1047f83>] warn_slowpath_fmt+0x33/0x40
> [ 5215.567823]  [<c14441e8>] i915_gpu_idle+0x88/0x90
> [ 5215.567833]  [<c1439949>] i915_gem_evict_something+0x269/0x300
> [ 5215.567843]  [<c144754f>] i915_gem_object_do_pin+0x6ef/0xb20
> [ 5215.567854]  [<c14479c5>] i915_gem_object_pin+0x45/0x50
> [ 5215.567864]  [<c1439f08>] i915_gem_execbuffer_reserve_vma.isra.13+0x78/0x180
> [ 5215.567874]  [<c143a2e5>] i915_gem_execbuffer_reserve+0x2d5/0x320
> [ 5215.567884]  [<c11594cd>] ? __kmalloc+0x14d/0x190
> [ 5215.567894]  [<c143b6d9>] i915_gem_do_execbuffer.isra.17+0x5c9/0xdd0
> [ 5215.567906]  [<c112efdb>] ? vm_mmap_pgoff+0x7b/0xa0
> [ 5215.567915]  [<c11594cd>] ? __kmalloc+0x14d/0x190
> [ 5215.567925]  [<c143cfeb>] i915_gem_execbuffer2+0x8b/0x2c0
> [ 5215.567934]  [<c143cf60>] ? i915_gem_execbuffer+0x4e0/0x4e0
> [ 5215.567944]  [<c1401d67>] drm_ioctl+0x1b7/0x510
> [ 5215.567954]  [<c1120a9a>] ? balance_dirty_pages_ratelimited+0x1a/0x6a0
> [ 5215.567963]  [<c143cf60>] ? i915_gem_execbuffer+0x4e0/0x4e0
> [ 5215.567975]  [<c113cef9>] ? handle_mm_fault+0x329/0x1250
> [ 5215.567984]  [<c1401bb0>] ? drm_getmap+0xb0/0xb0
> [ 5215.567994]  [<c117d9ca>] do_vfs_ioctl+0x30a/0x530
> [ 5215.568005]  [<c10a9e92>] ? ktime_get_ts64+0x52/0x1a0
> [ 5215.568095]  [<c1185f62>] ? __fget_light+0x22/0x60
> [ 5215.568136]  [<c117dc50>] SyS_ioctl+0x60/0x90
> [ 5215.568175]  [<c16b9bc8>] sysenter_do_call+0x12/0x12
> [ 5215.568198] ---[ end trace ab3f7e4953cb9eb6 ]---
> [ 5215.568272] ------------[ cut here ]------------
> [ 5215.568288] WARNING: CPU: 0 PID: 1304 at drivers/gpu/drm/i915/i915_gem_evict.c:283 i915_gem_evict_vm+0x10c/0x140()
> [ 5215.568292] WARN_ON(!list_empty(&vm->active_list))
> [ 5215.568296] Modules linked in: ctr ccm arc4 ath9k ath9k_common ath9k_hw bnep ath mac80211 rfcomm snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec uvcvideo snd_hwdep snd_pcm gpio_ich videobuf2_vmalloc dell_wmi cfg80211 videobuf2_memops sparse_keymap videobuf2_core dell_laptop snd_seq_midi v4l2_common dcdbas snd_seq_midi_event btusb videodev i8k snd_rawmidi snd_seq hid_multitouch coretemp bluetooth microcode snd_seq_device joydev snd_timer serio_raw snd shpchp soundcore wmi lpc_ich usbhid hid psmouse ahci libahci
> [ 5215.568383] CPU: 0 PID: 1304 Comm: Xorg Tainted: G        W  OE   4.0.0-rc4+ #108
> [ 5215.568388] Hardware name: Dell Inc. Inspiron 1090/Inspiron 1090, BIOS A06 08/23/2011
> [ 5215.568393]  00000000 00000000 f46e1cc0 c16b3e19 f46e1d00 f46e1cf0 c1047f17 c193712c
> [ 5215.568407]  f46e1d1c 00000518 c19370d0 0000011b c1439c6c c1439c6c f3b225b0 e733c3ec
> [ 5215.568421]  00000001 f46e1d08 c1047f83 00000009 f46e1d00 c193712c f46e1d1c f46e1d28
> [ 5215.568435] Call Trace:
> [ 5215.568445]  [<c16b3e19>] dump_stack+0x41/0x52
> [ 5215.568455]  [<c1047f17>] warn_slowpath_common+0x87/0xc0
> [ 5215.568465]  [<c1439c6c>] ? i915_gem_evict_vm+0x10c/0x140
> [ 5215.568474]  [<c1439c6c>] ? i915_gem_evict_vm+0x10c/0x140
> [ 5215.568483]  [<c1047f83>] warn_slowpath_fmt+0x33/0x40
> [ 5215.568492]  [<c1439c6c>] i915_gem_evict_vm+0x10c/0x140
> [ 5215.568502]  [<c143a236>] i915_gem_execbuffer_reserve+0x226/0x320
> [ 5215.568511]  [<c11594cd>] ? __kmalloc+0x14d/0x190
> [ 5215.568521]  [<c143b6d9>] i915_gem_do_execbuffer.isra.17+0x5c9/0xdd0
> [ 5215.568532]  [<c112efdb>] ? vm_mmap_pgoff+0x7b/0xa0
> [ 5215.568541]  [<c11594cd>] ? __kmalloc+0x14d/0x190
> [ 5215.568550]  [<c143cfeb>] i915_gem_execbuffer2+0x8b/0x2c0
> [ 5215.568560]  [<c143cf60>] ? i915_gem_execbuffer+0x4e0/0x4e0
> [ 5215.568568]  [<c1401d67>] drm_ioctl+0x1b7/0x510
> [ 5215.568577]  [<c1120a9a>] ? balance_dirty_pages_ratelimited+0x1a/0x6a0
> [ 5215.568587]  [<c143cf60>] ? i915_gem_execbuffer+0x4e0/0x4e0
> [ 5215.568599]  [<c113cef9>] ? handle_mm_fault+0x329/0x1250
> [ 5215.568607]  [<c1401bb0>] ? drm_getmap+0xb0/0xb0
> [ 5215.568616]  [<c117d9ca>] do_vfs_ioctl+0x30a/0x530
> [ 5215.568626]  [<c10a9e92>] ? ktime_get_ts64+0x52/0x1a0
> [ 5215.568635]  [<c1185f62>] ? __fget_light+0x22/0x60
> [ 5215.568644]  [<c117dc50>] SyS_ioctl+0x60/0x90
> [ 5215.568653]  [<c16b9bc8>] sysenter_do_call+0x12/0x12
> [ 5215.568659] ---[ end trace ab3f7e4953cb9eb7 ]---

Ah, so what it boils down to is that i915_gpu_idle() is a no-op here is
list_empty(ring->request_list)) [intel_ring_idle:2176].

Missing link discovered, I think the bug fixed by the patch is indeed
the same one that triggered the first WARN.
-Chris
Daniel Vetter March 20, 2015, 2:32 p.m. UTC | #7
On Fri, Mar 20, 2015 at 01:39:51PM +0000, Chris Wilson wrote:
> On Fri, Mar 20, 2015 at 01:02:10PM +0000, Chris Wilson wrote:
> > On Fri, Mar 20, 2015 at 11:06:57AM +0100, Daniel Vetter wrote:
> > > On Thu, Mar 19, 2015 at 10:17:42PM +0000, Chris Wilson wrote:
> > > > On Thu, Mar 19, 2015 at 06:37:28PM +0100, Daniel Vetter wrote:
> > > > > On Wed, Mar 18, 2015 at 06:19:22PM +0000, Chris Wilson wrote:
> > > > > > 	WARNING: CPU: 0 PID: 1383 at drivers/gpu/drm/i915/i915_gem_evict.c:279 i915_gem_evict_vm+0x10c/0x140()
> > > > > > 	WARN_ON(!list_empty(&vm->active_list))
> > > > > 
> > > > > How does this come about - we call gpu_idle before this seems to blow up,
> > > > > so all requests should be completed?
> > > > 
> > > > Honestly, I couldn't figure it out either. I had an epiphany when I saw
> > > > that we could now have an empty request list but non-empty active list
> > > > added a test to detect when that happens and shouted eureka when the
> > > > WARN fired. I could trigger the WARN in evict_vm pretty reliably, but
> > > > not since this patch. It could just be masking another bug.
> > > 
> > > Can you perhaps double-check the theory by putting a
> > > WARN_ON(list_empty(active_list) != list_empyt(request_list)) into
> > > gpu_idle? Ofc with this patch reverted so that the bug surfaces again.
> > 
> > [ 5215.567573] [drm:i915_verify_lists] *ERROR* render ring: active list not empty, but no requests
> > [ 5215.567586] ------------[ cut here ]------------
> > [ 5215.567598] WARNING: CPU: 0 PID: 1304 at drivers/gpu/drm/i915/i915_gem.c:3166 i915_gpu_idle+0x88/0x90()
> > [ 5215.567602] WARN_ON(i915_verify_lists(dev))
> > [ 5215.567606] Modules linked in: ctr ccm arc4 ath9k ath9k_common ath9k_hw bnep ath mac80211 rfcomm snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec uvcvideo snd_hwdep snd_pcm gpio_ich videobuf2_vmalloc dell_wmi cfg80211 videobuf2_memops sparse_keymap videobuf2_core dell_laptop snd_seq_midi v4l2_common dcdbas snd_seq_midi_event btusb videodev i8k snd_rawmidi snd_seq hid_multitouch coretemp bluetooth microcode snd_seq_device joydev snd_timer serio_raw snd shpchp soundcore wmi lpc_ich usbhid hid psmouse ahci libahci
> > [ 5215.567708] CPU: 0 PID: 1304 Comm: Xorg Tainted: G        W  OE   4.0.0-rc4+ #108
> > [ 5215.567713] Hardware name: Dell Inc. Inspiron 1090/Inspiron 1090, BIOS A06 08/23/2011
> > [ 5215.567718]  00000000 00000000 f46e1b98 c16b3e19 f46e1bd8 f46e1bc8 c1047f17 c1937e78
> > [ 5215.567733]  f46e1bf4 00000518 c1937cec 00000c5e c14441e8 c14441e8 e733bdc8 00000000
> > [ 5215.567747]  f6346c00 f46e1be0 c1047f83 00000009 f46e1bd8 c1937e78 f46e1bf4 f46e1c00
> > [ 5215.567762] Call Trace:
> > [ 5215.567776]  [<c16b3e19>] dump_stack+0x41/0x52
> > [ 5215.567788]  [<c1047f17>] warn_slowpath_common+0x87/0xc0
> > [ 5215.567797]  [<c14441e8>] ? i915_gpu_idle+0x88/0x90
> > [ 5215.567805]  [<c14441e8>] ? i915_gpu_idle+0x88/0x90
> > [ 5215.567815]  [<c1047f83>] warn_slowpath_fmt+0x33/0x40
> > [ 5215.567823]  [<c14441e8>] i915_gpu_idle+0x88/0x90
> > [ 5215.567833]  [<c1439949>] i915_gem_evict_something+0x269/0x300
> > [ 5215.567843]  [<c144754f>] i915_gem_object_do_pin+0x6ef/0xb20
> > [ 5215.567854]  [<c14479c5>] i915_gem_object_pin+0x45/0x50
> > [ 5215.567864]  [<c1439f08>] i915_gem_execbuffer_reserve_vma.isra.13+0x78/0x180
> > [ 5215.567874]  [<c143a2e5>] i915_gem_execbuffer_reserve+0x2d5/0x320
> > [ 5215.567884]  [<c11594cd>] ? __kmalloc+0x14d/0x190
> > [ 5215.567894]  [<c143b6d9>] i915_gem_do_execbuffer.isra.17+0x5c9/0xdd0
> > [ 5215.567906]  [<c112efdb>] ? vm_mmap_pgoff+0x7b/0xa0
> > [ 5215.567915]  [<c11594cd>] ? __kmalloc+0x14d/0x190
> > [ 5215.567925]  [<c143cfeb>] i915_gem_execbuffer2+0x8b/0x2c0
> > [ 5215.567934]  [<c143cf60>] ? i915_gem_execbuffer+0x4e0/0x4e0
> > [ 5215.567944]  [<c1401d67>] drm_ioctl+0x1b7/0x510
> > [ 5215.567954]  [<c1120a9a>] ? balance_dirty_pages_ratelimited+0x1a/0x6a0
> > [ 5215.567963]  [<c143cf60>] ? i915_gem_execbuffer+0x4e0/0x4e0
> > [ 5215.567975]  [<c113cef9>] ? handle_mm_fault+0x329/0x1250
> > [ 5215.567984]  [<c1401bb0>] ? drm_getmap+0xb0/0xb0
> > [ 5215.567994]  [<c117d9ca>] do_vfs_ioctl+0x30a/0x530
> > [ 5215.568005]  [<c10a9e92>] ? ktime_get_ts64+0x52/0x1a0
> > [ 5215.568095]  [<c1185f62>] ? __fget_light+0x22/0x60
> > [ 5215.568136]  [<c117dc50>] SyS_ioctl+0x60/0x90
> > [ 5215.568175]  [<c16b9bc8>] sysenter_do_call+0x12/0x12
> > [ 5215.568198] ---[ end trace ab3f7e4953cb9eb6 ]---
> > [ 5215.568272] ------------[ cut here ]------------
> > [ 5215.568288] WARNING: CPU: 0 PID: 1304 at drivers/gpu/drm/i915/i915_gem_evict.c:283 i915_gem_evict_vm+0x10c/0x140()
> > [ 5215.568292] WARN_ON(!list_empty(&vm->active_list))
> > [ 5215.568296] Modules linked in: ctr ccm arc4 ath9k ath9k_common ath9k_hw bnep ath mac80211 rfcomm snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec uvcvideo snd_hwdep snd_pcm gpio_ich videobuf2_vmalloc dell_wmi cfg80211 videobuf2_memops sparse_keymap videobuf2_core dell_laptop snd_seq_midi v4l2_common dcdbas snd_seq_midi_event btusb videodev i8k snd_rawmidi snd_seq hid_multitouch coretemp bluetooth microcode snd_seq_device joydev snd_timer serio_raw snd shpchp soundcore wmi lpc_ich usbhid hid psmouse ahci libahci
> > [ 5215.568383] CPU: 0 PID: 1304 Comm: Xorg Tainted: G        W  OE   4.0.0-rc4+ #108
> > [ 5215.568388] Hardware name: Dell Inc. Inspiron 1090/Inspiron 1090, BIOS A06 08/23/2011
> > [ 5215.568393]  00000000 00000000 f46e1cc0 c16b3e19 f46e1d00 f46e1cf0 c1047f17 c193712c
> > [ 5215.568407]  f46e1d1c 00000518 c19370d0 0000011b c1439c6c c1439c6c f3b225b0 e733c3ec
> > [ 5215.568421]  00000001 f46e1d08 c1047f83 00000009 f46e1d00 c193712c f46e1d1c f46e1d28
> > [ 5215.568435] Call Trace:
> > [ 5215.568445]  [<c16b3e19>] dump_stack+0x41/0x52
> > [ 5215.568455]  [<c1047f17>] warn_slowpath_common+0x87/0xc0
> > [ 5215.568465]  [<c1439c6c>] ? i915_gem_evict_vm+0x10c/0x140
> > [ 5215.568474]  [<c1439c6c>] ? i915_gem_evict_vm+0x10c/0x140
> > [ 5215.568483]  [<c1047f83>] warn_slowpath_fmt+0x33/0x40
> > [ 5215.568492]  [<c1439c6c>] i915_gem_evict_vm+0x10c/0x140
> > [ 5215.568502]  [<c143a236>] i915_gem_execbuffer_reserve+0x226/0x320
> > [ 5215.568511]  [<c11594cd>] ? __kmalloc+0x14d/0x190
> > [ 5215.568521]  [<c143b6d9>] i915_gem_do_execbuffer.isra.17+0x5c9/0xdd0
> > [ 5215.568532]  [<c112efdb>] ? vm_mmap_pgoff+0x7b/0xa0
> > [ 5215.568541]  [<c11594cd>] ? __kmalloc+0x14d/0x190
> > [ 5215.568550]  [<c143cfeb>] i915_gem_execbuffer2+0x8b/0x2c0
> > [ 5215.568560]  [<c143cf60>] ? i915_gem_execbuffer+0x4e0/0x4e0
> > [ 5215.568568]  [<c1401d67>] drm_ioctl+0x1b7/0x510
> > [ 5215.568577]  [<c1120a9a>] ? balance_dirty_pages_ratelimited+0x1a/0x6a0
> > [ 5215.568587]  [<c143cf60>] ? i915_gem_execbuffer+0x4e0/0x4e0
> > [ 5215.568599]  [<c113cef9>] ? handle_mm_fault+0x329/0x1250
> > [ 5215.568607]  [<c1401bb0>] ? drm_getmap+0xb0/0xb0
> > [ 5215.568616]  [<c117d9ca>] do_vfs_ioctl+0x30a/0x530
> > [ 5215.568626]  [<c10a9e92>] ? ktime_get_ts64+0x52/0x1a0
> > [ 5215.568635]  [<c1185f62>] ? __fget_light+0x22/0x60
> > [ 5215.568644]  [<c117dc50>] SyS_ioctl+0x60/0x90
> > [ 5215.568653]  [<c16b9bc8>] sysenter_do_call+0x12/0x12
> > [ 5215.568659] ---[ end trace ab3f7e4953cb9eb7 ]---

Ok, at least we have clear evidence now that the lists indeed seem to get
out of sync.

> Ah, so what it boils down to is that i915_gpu_idle() is a no-op here is
> list_empty(ring->request_list)) [intel_ring_idle:2176].
> 
> Missing link discovered, I think the bug fixed by the patch is indeed
> the same one that triggered the first WARN.

But if we do that short-circuiting in ring_idle the all the requests
_should_ be completed. Which meanse retire_request_ring should move all
buffers to the inactive list, even when we do that before retiring
requests.

I'm still baffled and don't really understand what's going on here ...
-Daniel
Chris Wilson March 20, 2015, 2:45 p.m. UTC | #8
On Fri, Mar 20, 2015 at 03:32:52PM +0100, Daniel Vetter wrote:
> But if we do that short-circuiting in ring_idle the all the requests
> _should_ be completed. Which meanse retire_request_ring should move all
> buffers to the inactive list, even when we do that before retiring
> requests.

We test for the requests to be retired after we test for the buffers to
be retired. It is very easy then for us to have active buffers as the
seqno advanced after the buffer retirement and before the requests. That
is (one of) the reasons why we previously sampled seqno only once when
retiring buffers + requests.
-Chris
Daniel Vetter March 20, 2015, 3 p.m. UTC | #9
On Fri, Mar 20, 2015 at 02:45:04PM +0000, Chris Wilson wrote:
> On Fri, Mar 20, 2015 at 03:32:52PM +0100, Daniel Vetter wrote:
> > But if we do that short-circuiting in ring_idle the all the requests
> > _should_ be completed. Which meanse retire_request_ring should move all
> > buffers to the inactive list, even when we do that before retiring
> > requests.
> 
> We test for the requests to be retired after we test for the buffers to
> be retired. It is very easy then for us to have active buffers as the
> seqno advanced after the buffer retirement and before the requests. That
> is (one of) the reasons why we previously sampled seqno only once when
> retiring buffers + requests.

Yeah I get that part of the race. But before we retire anything in these
callsites we call gpu_idle. And that waits for everything to complete,
except whent there are not outstanding requests (i.e. ->request_list is
empyt). So either
- ->request_list is empty in ring_idle, which means all requests should
  have completed.  Even if there are some lingering active buffers still
  around we should clean them up.
- ->request_list is not empty, in which case we do a full wait for the
  most recent request. Again all requests should have completed and we
  should be able to clean out both request and active lists.

I do see how we can get out of the retire_request functions with requests
empty but still active buffers around. But I don't understand how that's
possible with a gpu_idle in front. And thus far all traces are from places
where we do call gpu_idle first.

Or am I missing something?
-Daniel
Chris Wilson March 20, 2015, 3:04 p.m. UTC | #10
On Fri, Mar 20, 2015 at 04:00:50PM +0100, Daniel Vetter wrote:
> On Fri, Mar 20, 2015 at 02:45:04PM +0000, Chris Wilson wrote:
> > On Fri, Mar 20, 2015 at 03:32:52PM +0100, Daniel Vetter wrote:
> > > But if we do that short-circuiting in ring_idle the all the requests
> > > _should_ be completed. Which meanse retire_request_ring should move all
> > > buffers to the inactive list, even when we do that before retiring
> > > requests.
> > 
> > We test for the requests to be retired after we test for the buffers to
> > be retired. It is very easy then for us to have active buffers as the
> > seqno advanced after the buffer retirement and before the requests. That
> > is (one of) the reasons why we previously sampled seqno only once when
> > retiring buffers + requests.
> 
> Yeah I get that part of the race. But before we retire anything in these
> callsites we call gpu_idle. And that waits for everything to complete,
> except whent there are not outstanding requests (i.e. ->request_list is
> empyt). So either
> - ->request_list is empty in ring_idle, which means all requests should
>   have completed.  Even if there are some lingering active buffers still
>   around we should clean them up.
> - ->request_list is not empty, in which case we do a full wait for the
>   most recent request. Again all requests should have completed and we
>   should be able to clean out both request and active lists.
> 
> I do see how we can get out of the retire_request functions with requests
> empty but still active buffers around. But I don't understand how that's
> possible with a gpu_idle in front. And thus far all traces are from places
> where we do call gpu_idle first.
> 
> Or am I missing something?

The retire comes before the before the gpu_idle (we retire often as a
part of busy, execbuffer, timers etc). The traces show exactly that.
-Chris
Daniel Vetter March 20, 2015, 3:33 p.m. UTC | #11
On Fri, Mar 20, 2015 at 03:04:39PM +0000, Chris Wilson wrote:
> On Fri, Mar 20, 2015 at 04:00:50PM +0100, Daniel Vetter wrote:
> > On Fri, Mar 20, 2015 at 02:45:04PM +0000, Chris Wilson wrote:
> > > On Fri, Mar 20, 2015 at 03:32:52PM +0100, Daniel Vetter wrote:
> > > > But if we do that short-circuiting in ring_idle the all the requests
> > > > _should_ be completed. Which meanse retire_request_ring should move all
> > > > buffers to the inactive list, even when we do that before retiring
> > > > requests.
> > > 
> > > We test for the requests to be retired after we test for the buffers to
> > > be retired. It is very easy then for us to have active buffers as the
> > > seqno advanced after the buffer retirement and before the requests. That
> > > is (one of) the reasons why we previously sampled seqno only once when
> > > retiring buffers + requests.
> > 
> > Yeah I get that part of the race. But before we retire anything in these
> > callsites we call gpu_idle. And that waits for everything to complete,
> > except whent there are not outstanding requests (i.e. ->request_list is
> > empyt). So either
> > - ->request_list is empty in ring_idle, which means all requests should
> >   have completed.  Even if there are some lingering active buffers still
> >   around we should clean them up.
> > - ->request_list is not empty, in which case we do a full wait for the
> >   most recent request. Again all requests should have completed and we
> >   should be able to clean out both request and active lists.
> > 
> > I do see how we can get out of the retire_request functions with requests
> > empty but still active buffers around. But I don't understand how that's
> > possible with a gpu_idle in front. And thus far all traces are from places
> > where we do call gpu_idle first.
> > 
> > Or am I missing something?
> 
> The retire comes before the before the gpu_idle (we retire often as a
> part of busy, execbuffer, timers etc). The traces show exactly that.

Yeah, the sequence I see is:
1. retire requests leaves active objects behind with all requests retired.
2. evict_vim
|-> 2a. gpu_idle
|-> 2b. retire_requests
|-> 2c. WARN_ON(i915_gem_evict_vm);

I agree with you that before the call to evict_vm the lists are
inconsistent. What I don't understand how that inconsistency can get past
the 2a/2b double-punch.

Or do I have the wrong sequence in mind?
-Daniel
Chris Wilson March 20, 2015, 3:36 p.m. UTC | #12
On Fri, Mar 20, 2015 at 04:33:08PM +0100, Daniel Vetter wrote:
> On Fri, Mar 20, 2015 at 03:04:39PM +0000, Chris Wilson wrote:
> > The retire comes before the before the gpu_idle (we retire often as a
> > part of busy, execbuffer, timers etc). The traces show exactly that.
> 
> Yeah, the sequence I see is:
> 1. retire requests leaves active objects behind with all requests retired.
> 2. evict_vim
> |-> 2a. gpu_idle
> |-> 2b. retire_requests
> |-> 2c. WARN_ON(i915_gem_evict_vm);
> 
> I agree with you that before the call to evict_vm the lists are
> inconsistent. What I don't understand how that inconsistency can get past
> the 2a/2b double-punch.

2a/2b are both no-ops in this scenario.
-Chris
Daniel Vetter March 23, 2015, 8:43 a.m. UTC | #13
On Fri, Mar 20, 2015 at 03:36:11PM +0000, Chris Wilson wrote:
> On Fri, Mar 20, 2015 at 04:33:08PM +0100, Daniel Vetter wrote:
> > On Fri, Mar 20, 2015 at 03:04:39PM +0000, Chris Wilson wrote:
> > > The retire comes before the before the gpu_idle (we retire often as a
> > > part of busy, execbuffer, timers etc). The traces show exactly that.
> > 
> > Yeah, the sequence I see is:
> > 1. retire requests leaves active objects behind with all requests retired.
> > 2. evict_vim
> > |-> 2a. gpu_idle
> > |-> 2b. retire_requests
> > |-> 2c. WARN_ON(i915_gem_evict_vm);
> > 
> > I agree with you that before the call to evict_vm the lists are
> > inconsistent. What I don't understand how that inconsistency can get past
> > the 2a/2b double-punch.
> 
> 2a/2b are both no-ops in this scenario.

Lifted blindfolds, finally found the short-circuit

if (list_empty(&ring->request_list))
	return;

at the top of i915_gem_retire_requests_ring.

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

and for -fixes since the offending patch is in 4.0-rc1.
-Daniel
Daniel Vetter March 23, 2015, 8:49 a.m. UTC | #14
On Wed, Mar 18, 2015 at 06:19:22PM +0000, Chris Wilson wrote:
> If we retire requests last, we may use a later seqno and so clear
> the requests lists without clearing the active list, leading to
> confusion. Hence we should retire requests first for consistency with
> the early return. The order used to be important as the lifecycle for
> the object on the active list was determined by request->seqno. However,
> the requests themselves are now reference counted removing the
> constraint from the order of retirement.
> 
> Fixes regression from
> 
> commit 1b5a433a4dd967b125131da42b89b5cc0d5b1f57
> Author: John Harrison <John.C.Harrison@Intel.com>
> Date:   Mon Nov 24 18:49:42 2014 +0000
> 
>     drm/i915: Convert 'i915_seqno_passed' calls into 'i915_gem_request_completed
> '
> 
> and a
> 
> 	WARNING: CPU: 0 PID: 1383 at drivers/gpu/drm/i915/i915_gem_evict.c:279 i915_gem_evict_vm+0x10c/0x140()
> 	WARN_ON(!list_empty(&vm->active_list))
> 
> Identified by updating WATCH_LISTS:
> 
> 	[drm:i915_verify_lists] *ERROR* blitter ring: active list not empty, but no requests
> 	WARNING: CPU: 0 PID: 681 at drivers/gpu/drm/i915/i915_gem.c:2751 i915_gem_retire_requests_ring+0x149/0x230()
> 	WARN_ON(i915_verify_lists(ring->dev))
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>

In case it's burried too much in the thread:

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Addadendum for the commit:

"Note that this is only a problem in evict_vm where the following happens
after a retire_request has cleaned out all requests, but not all active
bo:
- intel_ring_idle called from i915_gpu_idle notices that no requests are
  outstanding and immediately returns.
- i915_gem_retire_requests_ring called from i915_gem_retire_requests also
  immediately returns when there's no request, still leaving the bo on the
  active list.
- evict_vm hits the WARN_ON(!list_empty(&vm->active_list)) after evicting
  all active objects that there's still stuff left that shouldn't be
  there."

Chris, is that an accurate enough description for Jani to add to the
patch?
-Daniel
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 38 +++++++++++++++++++++-----------------
>  1 file changed, 21 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 092f25cfb8d5..7a9589f38bbc 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2660,24 +2660,11 @@ i915_gem_retire_requests_ring(struct intel_engine_cs *ring)
>  
>  	WARN_ON(i915_verify_lists(ring->dev));
>  
> -	/* Move any buffers on the active list that are no longer referenced
> -	 * by the ringbuffer to the flushing/inactive lists as appropriate,
> -	 * before we free the context associated with the requests.
> +	/* Retire requests first as we use it above for the early return.
> +	 * If we retire requests last, we may use a later seqno and so clear
> +	 * the requests lists without clearing the active list, leading to
> +	 * confusion.
>  	 */
> -	while (!list_empty(&ring->active_list)) {
> -		struct drm_i915_gem_object *obj;
> -
> -		obj = list_first_entry(&ring->active_list,
> -				      struct drm_i915_gem_object,
> -				      ring_list);
> -
> -		if (!i915_gem_request_completed(obj->last_read_req, true))
> -			break;
> -
> -		i915_gem_object_move_to_inactive(obj);
> -	}
> -
> -
>  	while (!list_empty(&ring->request_list)) {
>  		struct drm_i915_gem_request *request;
>  
> @@ -2700,6 +2687,23 @@ i915_gem_retire_requests_ring(struct intel_engine_cs *ring)
>  		i915_gem_free_request(request);
>  	}
>  
> +	/* Move any buffers on the active list that are no longer referenced
> +	 * by the ringbuffer to the flushing/inactive lists as appropriate,
> +	 * before we free the context associated with the requests.
> +	 */
> +	while (!list_empty(&ring->active_list)) {
> +		struct drm_i915_gem_object *obj;
> +
> +		obj = list_first_entry(&ring->active_list,
> +				      struct drm_i915_gem_object,
> +				      ring_list);
> +
> +		if (!i915_gem_request_completed(obj->last_read_req, true))
> +			break;
> +
> +		i915_gem_object_move_to_inactive(obj);
> +	}
> +
>  	if (unlikely(ring->trace_irq_req &&
>  		     i915_gem_request_completed(ring->trace_irq_req, true))) {
>  		ring->irq_put(ring);
> -- 
> 2.1.4
>
Chris Wilson March 23, 2015, 9:13 a.m. UTC | #15
On Mon, Mar 23, 2015 at 09:49:15AM +0100, Daniel Vetter wrote:
> In case it's burried too much in the thread:
> 
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> Addadendum for the commit:
> 
> "Note that this is only a problem in evict_vm where the following happens
> after a retire_request has cleaned out all requests, but not all active
> bo:
> - intel_ring_idle called from i915_gpu_idle notices that no requests are
>   outstanding and immediately returns.
> - i915_gem_retire_requests_ring called from i915_gem_retire_requests also
>   immediately returns when there's no request, still leaving the bo on the
>   active list.
> - evict_vm hits the WARN_ON(!list_empty(&vm->active_list)) after evicting
>   all active objects that there's still stuff left that shouldn't be
>   there."
> 
> Chris, is that an accurate enough description for Jani to add to the
> patch?

Not quite. It affects everywhere where we rely on i915_gpu_idle() idling
the gpu (e.g. suspend), it is only the canary in i915_gem_evict_vm()
that first alerted us to the bug. Maybe we have some recent weird bugs in
suspend (or it may be too soon for those reports to start trickling in)?
-Chris
Chris Wilson March 23, 2015, 9:15 a.m. UTC | #16
On Mon, Mar 23, 2015 at 09:13:36AM +0000, Chris Wilson wrote:
> On Mon, Mar 23, 2015 at 09:49:15AM +0100, Daniel Vetter wrote:
> > In case it's burried too much in the thread:
> > 
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > 
> > Addadendum for the commit:
> > 
> > "Note that this is only a problem in evict_vm where the following happens
> > after a retire_request has cleaned out all requests, but not all active
> > bo:
> > - intel_ring_idle called from i915_gpu_idle notices that no requests are
> >   outstanding and immediately returns.
> > - i915_gem_retire_requests_ring called from i915_gem_retire_requests also
> >   immediately returns when there's no request, still leaving the bo on the
> >   active list.
> > - evict_vm hits the WARN_ON(!list_empty(&vm->active_list)) after evicting
> >   all active objects that there's still stuff left that shouldn't be
> >   there."
> > 
> > Chris, is that an accurate enough description for Jani to add to the
> > patch?
> 
> Not quite. It affects everywhere where we rely on i915_gpu_idle() idling
> the gpu (e.g. suspend), it is only the canary in i915_gem_evict_vm()
> that first alerted us to the bug. Maybe we have some recent weird bugs in
> suspend (or it may be too soon for those reports to start trickling in)?

I take that back. The gpu is idle (no more requests). It's just the
bookkeeping that's wrong (and that only affects eviction iirc).

I stand corrected.
-Chris
Daniel Vetter March 23, 2015, 9:40 a.m. UTC | #17
On Mon, Mar 23, 2015 at 09:15:02AM +0000, Chris Wilson wrote:
> On Mon, Mar 23, 2015 at 09:13:36AM +0000, Chris Wilson wrote:
> > On Mon, Mar 23, 2015 at 09:49:15AM +0100, Daniel Vetter wrote:
> > > In case it's burried too much in the thread:
> > > 
> > > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > > 
> > > Addadendum for the commit:
> > > 
> > > "Note that this is only a problem in evict_vm where the following happens
> > > after a retire_request has cleaned out all requests, but not all active
> > > bo:
> > > - intel_ring_idle called from i915_gpu_idle notices that no requests are
> > >   outstanding and immediately returns.
> > > - i915_gem_retire_requests_ring called from i915_gem_retire_requests also
> > >   immediately returns when there's no request, still leaving the bo on the
> > >   active list.
> > > - evict_vm hits the WARN_ON(!list_empty(&vm->active_list)) after evicting
> > >   all active objects that there's still stuff left that shouldn't be
> > >   there."
> > > 
> > > Chris, is that an accurate enough description for Jani to add to the
> > > patch?
> > 
> > Not quite. It affects everywhere where we rely on i915_gpu_idle() idling
> > the gpu (e.g. suspend), it is only the canary in i915_gem_evict_vm()
> > that first alerted us to the bug. Maybe we have some recent weird bugs in
> > suspend (or it may be too soon for those reports to start trickling in)?
> 
> I take that back. The gpu is idle (no more requests). It's just the
> bookkeeping that's wrong (and that only affects eviction iirc).
> 
> I stand corrected.

Yeah totally missed that tbh. But since we restore ggtt ptes on resume I
think even if we manage to leave a bunch of objects more behind we should
be fine. Worst-case there's more target area for stray writes from the
bios on S4 ;-)
-Daniel
Jani Nikula March 25, 2015, 11:43 a.m. UTC | #18
On Mon, 23 Mar 2015, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, Mar 18, 2015 at 06:19:22PM +0000, Chris Wilson wrote:
>> If we retire requests last, we may use a later seqno and so clear
>> the requests lists without clearing the active list, leading to
>> confusion. Hence we should retire requests first for consistency with
>> the early return. The order used to be important as the lifecycle for
>> the object on the active list was determined by request->seqno. However,
>> the requests themselves are now reference counted removing the
>> constraint from the order of retirement.
>> 
>> Fixes regression from
>> 
>> commit 1b5a433a4dd967b125131da42b89b5cc0d5b1f57
>> Author: John Harrison <John.C.Harrison@Intel.com>
>> Date:   Mon Nov 24 18:49:42 2014 +0000
>> 
>>     drm/i915: Convert 'i915_seqno_passed' calls into 'i915_gem_request_completed
>> '
>> 
>> and a
>> 
>> 	WARNING: CPU: 0 PID: 1383 at drivers/gpu/drm/i915/i915_gem_evict.c:279 i915_gem_evict_vm+0x10c/0x140()
>> 	WARN_ON(!list_empty(&vm->active_list))
>> 
>> Identified by updating WATCH_LISTS:
>> 
>> 	[drm:i915_verify_lists] *ERROR* blitter ring: active list not empty, but no requests
>> 	WARNING: CPU: 0 PID: 681 at drivers/gpu/drm/i915/i915_gem.c:2751 i915_gem_retire_requests_ring+0x149/0x230()
>> 	WARN_ON(i915_verify_lists(ring->dev))
>> 
>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: John Harrison <John.C.Harrison@Intel.com>
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> In case it's burried too much in the thread:
>
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> Addadendum for the commit:
>
> "Note that this is only a problem in evict_vm where the following happens
> after a retire_request has cleaned out all requests, but not all active
> bo:
> - intel_ring_idle called from i915_gpu_idle notices that no requests are
>   outstanding and immediately returns.
> - i915_gem_retire_requests_ring called from i915_gem_retire_requests also
>   immediately returns when there's no request, still leaving the bo on the
>   active list.
> - evict_vm hits the WARN_ON(!list_empty(&vm->active_list)) after evicting
>   all active objects that there's still stuff left that shouldn't be
>   there."

Pushed to drm-intel-fixes with the above note added. Thanks for the
patch and review.

BR,
Jani.


>
> Chris, is that an accurate enough description for Jani to add to the
> patch?
> -Daniel
>> ---
>>  drivers/gpu/drm/i915/i915_gem.c | 38 +++++++++++++++++++++-----------------
>>  1 file changed, 21 insertions(+), 17 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> index 092f25cfb8d5..7a9589f38bbc 100644
>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> @@ -2660,24 +2660,11 @@ i915_gem_retire_requests_ring(struct intel_engine_cs *ring)
>>  
>>  	WARN_ON(i915_verify_lists(ring->dev));
>>  
>> -	/* Move any buffers on the active list that are no longer referenced
>> -	 * by the ringbuffer to the flushing/inactive lists as appropriate,
>> -	 * before we free the context associated with the requests.
>> +	/* Retire requests first as we use it above for the early return.
>> +	 * If we retire requests last, we may use a later seqno and so clear
>> +	 * the requests lists without clearing the active list, leading to
>> +	 * confusion.
>>  	 */
>> -	while (!list_empty(&ring->active_list)) {
>> -		struct drm_i915_gem_object *obj;
>> -
>> -		obj = list_first_entry(&ring->active_list,
>> -				      struct drm_i915_gem_object,
>> -				      ring_list);
>> -
>> -		if (!i915_gem_request_completed(obj->last_read_req, true))
>> -			break;
>> -
>> -		i915_gem_object_move_to_inactive(obj);
>> -	}
>> -
>> -
>>  	while (!list_empty(&ring->request_list)) {
>>  		struct drm_i915_gem_request *request;
>>  
>> @@ -2700,6 +2687,23 @@ i915_gem_retire_requests_ring(struct intel_engine_cs *ring)
>>  		i915_gem_free_request(request);
>>  	}
>>  
>> +	/* Move any buffers on the active list that are no longer referenced
>> +	 * by the ringbuffer to the flushing/inactive lists as appropriate,
>> +	 * before we free the context associated with the requests.
>> +	 */
>> +	while (!list_empty(&ring->active_list)) {
>> +		struct drm_i915_gem_object *obj;
>> +
>> +		obj = list_first_entry(&ring->active_list,
>> +				      struct drm_i915_gem_object,
>> +				      ring_list);
>> +
>> +		if (!i915_gem_request_completed(obj->last_read_req, true))
>> +			break;
>> +
>> +		i915_gem_object_move_to_inactive(obj);
>> +	}
>> +
>>  	if (unlikely(ring->trace_irq_req &&
>>  		     i915_gem_request_completed(ring->trace_irq_req, true))) {
>>  		ring->irq_put(ring);
>> -- 
>> 2.1.4
>> 
>
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 092f25cfb8d5..7a9589f38bbc 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2660,24 +2660,11 @@  i915_gem_retire_requests_ring(struct intel_engine_cs *ring)
 
 	WARN_ON(i915_verify_lists(ring->dev));
 
-	/* Move any buffers on the active list that are no longer referenced
-	 * by the ringbuffer to the flushing/inactive lists as appropriate,
-	 * before we free the context associated with the requests.
+	/* Retire requests first as we use it above for the early return.
+	 * If we retire requests last, we may use a later seqno and so clear
+	 * the requests lists without clearing the active list, leading to
+	 * confusion.
 	 */
-	while (!list_empty(&ring->active_list)) {
-		struct drm_i915_gem_object *obj;
-
-		obj = list_first_entry(&ring->active_list,
-				      struct drm_i915_gem_object,
-				      ring_list);
-
-		if (!i915_gem_request_completed(obj->last_read_req, true))
-			break;
-
-		i915_gem_object_move_to_inactive(obj);
-	}
-
-
 	while (!list_empty(&ring->request_list)) {
 		struct drm_i915_gem_request *request;
 
@@ -2700,6 +2687,23 @@  i915_gem_retire_requests_ring(struct intel_engine_cs *ring)
 		i915_gem_free_request(request);
 	}
 
+	/* Move any buffers on the active list that are no longer referenced
+	 * by the ringbuffer to the flushing/inactive lists as appropriate,
+	 * before we free the context associated with the requests.
+	 */
+	while (!list_empty(&ring->active_list)) {
+		struct drm_i915_gem_object *obj;
+
+		obj = list_first_entry(&ring->active_list,
+				      struct drm_i915_gem_object,
+				      ring_list);
+
+		if (!i915_gem_request_completed(obj->last_read_req, true))
+			break;
+
+		i915_gem_object_move_to_inactive(obj);
+	}
+
 	if (unlikely(ring->trace_irq_req &&
 		     i915_gem_request_completed(ring->trace_irq_req, true))) {
 		ring->irq_put(ring);