Message ID | 1377165321-29871-1-git-send-email-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Aug 22, 2013 at 11:55:21AM +0200, Daniel Vetter wrote: > The important bugfix here is that we must not unlink the vma when > we keep it around as a placeholder for the execbuf code. Since then we > won't find it again when execbuf gets interrupt and restarted and > create a 2nd vma. And since the code as-is isn't fit yet to deal with > more than one vma, hilarity ensues. This was the primary distinction between my vma->busy patch, and the patch you actually merged from Chris. AFAICT, you've adopted my vma->busy behavior which I remember being called out as a bug by Chris (and being convinced he was right). So can you explain for my sanity how this differs in behavior from vma->busy? > > Specifically the dma map/unmap of the sg table isn't adjusted for > multiple vmas yet and will blow up like this: > > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.750976] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751021] IP: [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751073] PGD 56bb5067 PUD ad3dd067 PMD 0 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751096] Oops: 0000 [#1] SMP > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751114] Modules linked in: tcp_lp ppdev parport_pc lp parport ipv6 dm_mod dcdbas snd_hda_codec_hdmi pcspkr snd_hda_codec_realtek serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support snd_hda_intel snd_hda_codec lpc_ich snd_hwdep mfd_core snd_pcm snd_page_alloc snd_timer snd soundcore acpi_cpufreq i915 video button drm_kms_helper drm mperf freq_table > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751291] CPU: 1 PID: 16650 Comm: fbo-maxsize Not tainted 3.11.0-rc4_nightlytop_d93f59_debug_20130814_+ #6957 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751334] Hardware name: Dell Inc. OptiPlex 9010/03JR84, BIOS A01 05/04/2012 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751365] task: ffff8800563b3f00 ti: ffff88004bdf4000 task.ti: ffff88004bdf4000 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751396] RIP: 0010:[<ffffffffa008fb37>] [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751446] RSP: 0018:ffff88004bdf5958 EFLAGS: 00010246 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751469] RAX: 0000000000000000 RBX: ffff8801135e0000 RCX: ffff8800ad3bf8e0 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751499] RDX: ffff8800ad3bf8e0 RSI: 0000000000000000 RDI: ffff8801007ee780 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751529] RBP: ffff88004bdf5978 R08: ffff8800ad3bf8e0 R09: 0000000000000000 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751559] R10: ffffffff86ca1810 R11: ffff880036a17101 R12: ffff8801007ee780 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751589] R13: 0000000000018001 R14: ffff880118c4e000 R15: ffff8801007ee780 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751619] FS: 00007f401a0ce740(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751653] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751677] CR2: 0000000000000008 CR3: 000000005635c000 CR4: 00000000001407e0 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751707] Stack: > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751717] ffff8801007ee780 ffff88005c253180 0000000000018000 ffff8801135e0000 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751752] ffff88004bdf59a8 ffffffffa0088e55 0000000000000011 ffff8801007eec00 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751788] 0000000000018000 ffff880036a17101 ffff88004bdf5a08 ffffffffa0089026 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751823] Call Trace: > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751843] [<ffffffffa0088e55>] i915_vma_unbind+0xdf/0x1ab [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751875] [<ffffffffa0089026>] __i915_gem_shrink+0x105/0x177 [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751909] [<ffffffffa0089452>] i915_gem_object_get_pages_gtt+0x108/0x309 [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751947] [<ffffffffa0085ba9>] i915_gem_object_get_pages+0x61/0x90 [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751983] [<ffffffffa008f22b>] ? gen6_ppgtt_insert_entries+0x103/0x125 [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752019] [<ffffffffa008a113>] i915_gem_object_pin+0x1fa/0x5df [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752054] [<ffffffffa008cdfe>] i915_gem_execbuffer_reserve_object.isra.6+0x8d/0x1bc [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752094] [<ffffffffa008d156>] i915_gem_execbuffer_reserve+0x229/0x367 [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752131] [<ffffffffa008dbf6>] i915_gem_do_execbuffer.isra.12+0x4dc/0xf3a [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752166] [<ffffffff810fc823>] ? might_fault+0x40/0x90 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752195] [<ffffffffa008eb89>] i915_gem_execbuffer2+0x187/0x222 [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752231] [<ffffffffa000971c>] drm_ioctl+0x308/0x442 [drm] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752261] [<ffffffffa008ea02>] ? i915_gem_execbuffer+0x3ae/0x3ae [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752292] [<ffffffff817db156>] ? __do_page_fault+0x3dd/0x481 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752318] [<ffffffff8112fdba>] vfs_ioctl+0x26/0x39 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752341] [<ffffffff811306a2>] do_vfs_ioctl+0x40e/0x451 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752366] [<ffffffff817deda7>] ? sysret_check+0x1b/0x56 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752391] [<ffffffff8113073c>] SyS_ioctl+0x57/0x87 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752415] [<ffffffff8135bbfe>] ? trace_hardirqs_on_thunk+0x3a/0x3f > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752443] [<ffffffff817ded82>] system_call_fastpath+0x16/0x1b > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752469] Code: 48 c7 c6 84 30 0e a0 31 c0 e8 d0 e9 f7 ff bf c6 a7 00 00 e8 07 af 2c e1 41 f6 84 24 03 01 00 00 10 75 44 49 8b 84 24 08 01 00 00 <8b> 50 08 48 8b 30 49 8b 86 b0 04 00 00 48 89 c7 48 81 c7 98 00 > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752624] RIP [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915] > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.753750] RSP <ffff88004bdf5958> > Aug 15 20:12:37 x-ivb9 kernel: [ 8081.754851] CR2: 0000000000000008 > > The second thing to correct is the "only one vma for now" check in > vma_unbind - since vma_destroy isn't always called the obj->vma_list > might not be empty. Instead check that the vma list is singular at the > beginning of vma_unbind. This is also more symmetric with bind_to_vm. It's not a "correction" exactly. It's to address the fact that the patch makes the destroy conditional. The old assertion was correct. > > Our first attempting at fixing this was > > commit 1be81a2f2cfd8789a627401d470423358fba2d76 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Tue Aug 20 12:56:40 2013 +0100 > > drm/i915: Don't destroy the vma placeholder during execbuffer reservation > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Ben Widawsky <ben@bwidawsk.net> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68298 > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68171 > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> It's not clear from the commit message if this actually fixes the bug for you (which you seemed to be able to reproduce). Did it? > --- > drivers/gpu/drm/i915/i915_gem.c | 15 +++++++-------- > 1 file changed, 7 insertions(+), 8 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index ef92c69..3e855ff 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2616,6 +2616,9 @@ int i915_vma_unbind(struct i915_vma *vma) > drm_i915_private_t *dev_priv = obj->base.dev->dev_private; > int ret; > > + /* For now we only ever use 1 vma per object */ > + WARN_ON(!list_is_singular(&obj->vma_list)); > + > if (list_empty(&vma->vma_link)) > return 0; > > @@ -2661,12 +2664,12 @@ int i915_vma_unbind(struct i915_vma *vma) > drm_mm_remove_node(&vma->node); > > destroy: > - i915_gem_vma_destroy(vma); > + /* Keep the vma as a placeholder in the execbuffer reservation lists */ > + if (!list_empty(&vma->exec_list)) > + i915_gem_vma_destroy(vma); Please see my vma->busy question at the top. To me this looks like the behavior I had with vma->busy where you don't unlink from the vma_list (which actually defies the original convention I had about the vma_list, whereby a vma exists on a vma list IFF it has space allocated in some address space). I think this is necessary though, as there are really three states: 1. Bound 2. Temporarily unbound <-- this is the problematic one 3. Unbound > > /* Since the unbound list is global, only move to that list if > - * no more VMAs exist. > - * NB: Until we have real VMAs there will only ever be one */ > - WARN_ON(!list_empty(&obj->vma_list)); > + * no more VMAs exist. */ > if (list_empty(&obj->vma_list)) > list_move_tail(&obj->global_list, &dev_priv->mm.unbound_list); > > @@ -4171,10 +4174,6 @@ void i915_gem_vma_destroy(struct i915_vma *vma) > WARN_ON(vma->node.allocated); > list_del(&vma->vma_link); > > - /* Keep the vma as a placeholder in the execbuffer reservation lists */ > - if (!list_empty(&vma->exec_list)) > - return; > - > kfree(vma); > } If you really wan to remove this hunk, I think you should do it as a revert (or remove it from the history completely if it isn't too late). Instead though I think turning this into a WARN is still a valid thing to do. Unless I've missed something. It shouldn't ever be safe to destroy a VMA on an exec_list because we use the same list element in eviction and if we accidentally destroy in eviction (called from execbuf) we're in pretty big trouble. Finally, I still think Chris' original suggestion of creating a new member for eviction would make a lot of these bugs at least a little bit easier to find. It's always hard to say though until you actually do it.
On Thu, Aug 22, 2013 at 6:19 PM, Ben Widawsky <ben@bwidawsk.net> wrote: > On Thu, Aug 22, 2013 at 11:55:21AM +0200, Daniel Vetter wrote: >> The important bugfix here is that we must not unlink the vma when >> we keep it around as a placeholder for the execbuf code. Since then we >> won't find it again when execbuf gets interrupt and restarted and >> create a 2nd vma. And since the code as-is isn't fit yet to deal with >> more than one vma, hilarity ensues. > > This was the primary distinction between my vma->busy patch, and the > patch you actually merged from Chris. AFAICT, you've adopted my > vma->busy behavior which I remember being called out as a bug by Chris > (and being convinced he was right). > > So can you explain for my sanity how this differs in behavior from > vma->busy? It should now match in behaviour with your busy patches I think. I still believe that the more explicit vma->exec_list check is clearer and introduces less redundancy. >> Specifically the dma map/unmap of the sg table isn't adjusted for >> multiple vmas yet and will blow up like this: >> >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.750976] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751021] IP: [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751073] PGD 56bb5067 PUD ad3dd067 PMD 0 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751096] Oops: 0000 [#1] SMP >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751114] Modules linked in: tcp_lp ppdev parport_pc lp parport ipv6 dm_mod dcdbas snd_hda_codec_hdmi pcspkr snd_hda_codec_realtek serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support snd_hda_intel snd_hda_codec lpc_ich snd_hwdep mfd_core snd_pcm snd_page_alloc snd_timer snd soundcore acpi_cpufreq i915 video button drm_kms_helper drm mperf freq_table >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751291] CPU: 1 PID: 16650 Comm: fbo-maxsize Not tainted 3.11.0-rc4_nightlytop_d93f59_debug_20130814_+ #6957 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751334] Hardware name: Dell Inc. OptiPlex 9010/03JR84, BIOS A01 05/04/2012 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751365] task: ffff8800563b3f00 ti: ffff88004bdf4000 task.ti: ffff88004bdf4000 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751396] RIP: 0010:[<ffffffffa008fb37>] [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751446] RSP: 0018:ffff88004bdf5958 EFLAGS: 00010246 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751469] RAX: 0000000000000000 RBX: ffff8801135e0000 RCX: ffff8800ad3bf8e0 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751499] RDX: ffff8800ad3bf8e0 RSI: 0000000000000000 RDI: ffff8801007ee780 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751529] RBP: ffff88004bdf5978 R08: ffff8800ad3bf8e0 R09: 0000000000000000 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751559] R10: ffffffff86ca1810 R11: ffff880036a17101 R12: ffff8801007ee780 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751589] R13: 0000000000018001 R14: ffff880118c4e000 R15: ffff8801007ee780 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751619] FS: 00007f401a0ce740(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751653] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751677] CR2: 0000000000000008 CR3: 000000005635c000 CR4: 00000000001407e0 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751707] Stack: >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751717] ffff8801007ee780 ffff88005c253180 0000000000018000 ffff8801135e0000 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751752] ffff88004bdf59a8 ffffffffa0088e55 0000000000000011 ffff8801007eec00 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751788] 0000000000018000 ffff880036a17101 ffff88004bdf5a08 ffffffffa0089026 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751823] Call Trace: >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751843] [<ffffffffa0088e55>] i915_vma_unbind+0xdf/0x1ab [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751875] [<ffffffffa0089026>] __i915_gem_shrink+0x105/0x177 [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751909] [<ffffffffa0089452>] i915_gem_object_get_pages_gtt+0x108/0x309 [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751947] [<ffffffffa0085ba9>] i915_gem_object_get_pages+0x61/0x90 [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751983] [<ffffffffa008f22b>] ? gen6_ppgtt_insert_entries+0x103/0x125 [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752019] [<ffffffffa008a113>] i915_gem_object_pin+0x1fa/0x5df [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752054] [<ffffffffa008cdfe>] i915_gem_execbuffer_reserve_object.isra.6+0x8d/0x1bc [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752094] [<ffffffffa008d156>] i915_gem_execbuffer_reserve+0x229/0x367 [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752131] [<ffffffffa008dbf6>] i915_gem_do_execbuffer.isra.12+0x4dc/0xf3a [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752166] [<ffffffff810fc823>] ? might_fault+0x40/0x90 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752195] [<ffffffffa008eb89>] i915_gem_execbuffer2+0x187/0x222 [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752231] [<ffffffffa000971c>] drm_ioctl+0x308/0x442 [drm] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752261] [<ffffffffa008ea02>] ? i915_gem_execbuffer+0x3ae/0x3ae [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752292] [<ffffffff817db156>] ? __do_page_fault+0x3dd/0x481 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752318] [<ffffffff8112fdba>] vfs_ioctl+0x26/0x39 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752341] [<ffffffff811306a2>] do_vfs_ioctl+0x40e/0x451 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752366] [<ffffffff817deda7>] ? sysret_check+0x1b/0x56 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752391] [<ffffffff8113073c>] SyS_ioctl+0x57/0x87 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752415] [<ffffffff8135bbfe>] ? trace_hardirqs_on_thunk+0x3a/0x3f >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752443] [<ffffffff817ded82>] system_call_fastpath+0x16/0x1b >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752469] Code: 48 c7 c6 84 30 0e a0 31 c0 e8 d0 e9 f7 ff bf c6 a7 00 00 e8 07 af 2c e1 41 f6 84 24 03 01 00 00 10 75 44 49 8b 84 24 08 01 00 00 <8b> 50 08 48 8b 30 49 8b 86 b0 04 00 00 48 89 c7 48 81 c7 98 00 >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752624] RIP [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915] >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.753750] RSP <ffff88004bdf5958> >> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.754851] CR2: 0000000000000008 >> >> The second thing to correct is the "only one vma for now" check in >> vma_unbind - since vma_destroy isn't always called the obj->vma_list >> might not be empty. Instead check that the vma list is singular at the >> beginning of vma_unbind. This is also more symmetric with bind_to_vm. > > It's not a "correction" exactly. It's to address the fact that the patch > makes the destroy conditional. The old assertion was correct. Yeah, "As a consequence we need to ..." or so is better. >> Our first attempting at fixing this was >> >> commit 1be81a2f2cfd8789a627401d470423358fba2d76 >> Author: Chris Wilson <chris@chris-wilson.co.uk> >> Date: Tue Aug 20 12:56:40 2013 +0100 >> >> drm/i915: Don't destroy the vma placeholder during execbuffer reservation >> >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> Cc: Ben Widawsky <ben@bwidawsk.net> >> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68298 >> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68171 >> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> > > It's not clear from the commit message if this actually fixes the bug > for you (which you seemed to be able to reproduce). Did it? Nope, just hard looking at the Oopses, haven't yet reproduced the bug here. But now that I have a theory what's busted that should help a bit with writing igts ... >> --- >> drivers/gpu/drm/i915/i915_gem.c | 15 +++++++-------- >> 1 file changed, 7 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c >> index ef92c69..3e855ff 100644 >> --- a/drivers/gpu/drm/i915/i915_gem.c >> +++ b/drivers/gpu/drm/i915/i915_gem.c >> @@ -2616,6 +2616,9 @@ int i915_vma_unbind(struct i915_vma *vma) >> drm_i915_private_t *dev_priv = obj->base.dev->dev_private; >> int ret; >> >> + /* For now we only ever use 1 vma per object */ >> + WARN_ON(!list_is_singular(&obj->vma_list)); >> + >> if (list_empty(&vma->vma_link)) >> return 0; >> >> @@ -2661,12 +2664,12 @@ int i915_vma_unbind(struct i915_vma *vma) >> drm_mm_remove_node(&vma->node); >> >> destroy: >> - i915_gem_vma_destroy(vma); >> + /* Keep the vma as a placeholder in the execbuffer reservation lists */ >> + if (!list_empty(&vma->exec_list)) >> + i915_gem_vma_destroy(vma); > > Please see my vma->busy question at the top. To me this looks like the > behavior I had with vma->busy where you don't unlink from the vma_list > (which actually defies the original convention I had about the > vma_list, whereby a vma exists on a vma list IFF it has space allocated > in some address space). I think this is necessary though, as there are > really three states: > 1. Bound > 2. Temporarily unbound <-- this is the problematic one > 3. Unbound Yes. > >> >> /* Since the unbound list is global, only move to that list if >> - * no more VMAs exist. >> - * NB: Until we have real VMAs there will only ever be one */ >> - WARN_ON(!list_empty(&obj->vma_list)); >> + * no more VMAs exist. */ >> if (list_empty(&obj->vma_list)) >> list_move_tail(&obj->global_list, &dev_priv->mm.unbound_list); >> >> @@ -4171,10 +4174,6 @@ void i915_gem_vma_destroy(struct i915_vma *vma) >> WARN_ON(vma->node.allocated); >> list_del(&vma->vma_link); >> >> - /* Keep the vma as a placeholder in the execbuffer reservation lists */ >> - if (!list_empty(&vma->exec_list)) >> - return; >> - >> kfree(vma); >> } > > If you really wan to remove this hunk, I think you should do it as a > revert (or remove it from the history completely if it isn't too late). > Instead though I think turning this into a WARN is still a valid thing > to do. Unless I've missed something. It shouldn't ever be safe to > destroy a VMA on an exec_list because we use the same list element in > eviction and if we accidentally destroy in eviction (called from > execbuf) we're in pretty big trouble. I think I'll squash this in when merging with Chris' patch and fixup the commit message a bit. For more paranoia in the evict code Chris suggested to add a check in mark_free for list_empty(vma->exec_list). Would be good in a follow-up patch imo. > Finally, I still think Chris' original suggestion of creating a new > member for eviction would make a lot of these bugs at least a little bit > easier to find. It's always hard to say though until you actually do it. Yeah, we could just go back to the old ways of using the lru lists to unwind the eviction roaster (i.e. active/inactive vm lists now). That's also why I think we could just merge the active/inactive list together into one, since without that merge it'll be trouble. But when Chris merged the fair lru eviction code he frobbed it a bit to use the current temporary list. Adding yet another list is imo wasteful ;-) Cheers, Daniel
On Thu, Aug 22, 2013 at 09:41:07PM +0200, Daniel Vetter wrote: [snip] > > > > It's not clear from the commit message if this actually fixes the bug > > for you (which you seemed to be able to reproduce). Did it? > > Nope, just hard looking at the Oopses, haven't yet reproduced the bug > here. But now that I have a theory what's busted that should help a > bit with writing igts ... > In that case, I think you need the same check in the failure path in bind_to_vm. I have that fixed locally, but haven't actually tested it. > >> --- > >> drivers/gpu/drm/i915/i915_gem.c | 15 +++++++-------- > >> 1 file changed, 7 insertions(+), 8 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > >> index ef92c69..3e855ff 100644 > >> --- a/drivers/gpu/drm/i915/i915_gem.c > >> +++ b/drivers/gpu/drm/i915/i915_gem.c > >> @@ -2616,6 +2616,9 @@ int i915_vma_unbind(struct i915_vma *vma) > >> drm_i915_private_t *dev_priv = obj->base.dev->dev_private; > >> int ret; > >> > >> + /* For now we only ever use 1 vma per object */ > >> + WARN_ON(!list_is_singular(&obj->vma_list)); > >> + > >> if (list_empty(&vma->vma_link)) > >> return 0; > >> > >> @@ -2661,12 +2664,12 @@ int i915_vma_unbind(struct i915_vma *vma) > >> drm_mm_remove_node(&vma->node); > >> > >> destroy: > >> - i915_gem_vma_destroy(vma); > >> + /* Keep the vma as a placeholder in the execbuffer reservation lists */ > >> + if (!list_empty(&vma->exec_list)) > >> + i915_gem_vma_destroy(vma); > > > > Please see my vma->busy question at the top. To me this looks like the > > behavior I had with vma->busy where you don't unlink from the vma_list > > (which actually defies the original convention I had about the > > vma_list, whereby a vma exists on a vma list IFF it has space allocated > > in some address space). I think this is necessary though, as there are > > really three states: > > 1. Bound > > 2. Temporarily unbound <-- this is the problematic one > > 3. Unbound > > Yes. > > > > >> > >> /* Since the unbound list is global, only move to that list if > >> - * no more VMAs exist. > >> - * NB: Until we have real VMAs there will only ever be one */ > >> - WARN_ON(!list_empty(&obj->vma_list)); > >> + * no more VMAs exist. */ > >> if (list_empty(&obj->vma_list)) > >> list_move_tail(&obj->global_list, &dev_priv->mm.unbound_list); > >> > >> @@ -4171,10 +4174,6 @@ void i915_gem_vma_destroy(struct i915_vma *vma) > >> WARN_ON(vma->node.allocated); > >> list_del(&vma->vma_link); > >> > >> - /* Keep the vma as a placeholder in the execbuffer reservation lists */ > >> - if (!list_empty(&vma->exec_list)) > >> - return; > >> - > >> kfree(vma); > >> } > > > > If you really wan to remove this hunk, I think you should do it as a > > revert (or remove it from the history completely if it isn't too late). > > Instead though I think turning this into a WARN is still a valid thing > > to do. Unless I've missed something. It shouldn't ever be safe to > > destroy a VMA on an exec_list because we use the same list element in > > eviction and if we accidentally destroy in eviction (called from > > execbuf) we're in pretty big trouble. > > I think I'll squash this in when merging with Chris' patch and fixup > the commit message a bit. For more paranoia in the evict code Chris > suggested to add a check in mark_free for list_empty(vma->exec_list). > Would be good in a follow-up patch imo. I have this exact same check in my current code to debug this, I think it makes a lot of sense. > > > Finally, I still think Chris' original suggestion of creating a new > > member for eviction would make a lot of these bugs at least a little bit > > easier to find. It's always hard to say though until you actually do it. > > Yeah, we could just go back to the old ways of using the lru lists to > unwind the eviction roaster (i.e. active/inactive vm lists now). > That's also why I think we could just merge the active/inactive list > together into one, since without that merge it'll be trouble. But when > Chris merged the fair lru eviction code he frobbed it a bit to use the > current temporary list. > > Adding yet another list is imo wasteful ;-) I guess we'll see how difficult the bugs are to fix. It uses extra memory, that much is indisputable.
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index ef92c69..3e855ff 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2616,6 +2616,9 @@ int i915_vma_unbind(struct i915_vma *vma) drm_i915_private_t *dev_priv = obj->base.dev->dev_private; int ret; + /* For now we only ever use 1 vma per object */ + WARN_ON(!list_is_singular(&obj->vma_list)); + if (list_empty(&vma->vma_link)) return 0; @@ -2661,12 +2664,12 @@ int i915_vma_unbind(struct i915_vma *vma) drm_mm_remove_node(&vma->node); destroy: - i915_gem_vma_destroy(vma); + /* Keep the vma as a placeholder in the execbuffer reservation lists */ + if (!list_empty(&vma->exec_list)) + i915_gem_vma_destroy(vma); /* Since the unbound list is global, only move to that list if - * no more VMAs exist. - * NB: Until we have real VMAs there will only ever be one */ - WARN_ON(!list_empty(&obj->vma_list)); + * no more VMAs exist. */ if (list_empty(&obj->vma_list)) list_move_tail(&obj->global_list, &dev_priv->mm.unbound_list); @@ -4171,10 +4174,6 @@ void i915_gem_vma_destroy(struct i915_vma *vma) WARN_ON(vma->node.allocated); list_del(&vma->vma_link); - /* Keep the vma as a placeholder in the execbuffer reservation lists */ - if (!list_empty(&vma->exec_list)) - return; - kfree(vma); }
The important bugfix here is that we must not unlink the vma when we keep it around as a placeholder for the execbuf code. Since then we won't find it again when execbuf gets interrupt and restarted and create a 2nd vma. And since the code as-is isn't fit yet to deal with more than one vma, hilarity ensues. Specifically the dma map/unmap of the sg table isn't adjusted for multiple vmas yet and will blow up like this: Aug 15 20:12:37 x-ivb9 kernel: [ 8081.750976] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751021] IP: [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751073] PGD 56bb5067 PUD ad3dd067 PMD 0 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751096] Oops: 0000 [#1] SMP Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751114] Modules linked in: tcp_lp ppdev parport_pc lp parport ipv6 dm_mod dcdbas snd_hda_codec_hdmi pcspkr snd_hda_codec_realtek serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support snd_hda_intel snd_hda_codec lpc_ich snd_hwdep mfd_core snd_pcm snd_page_alloc snd_timer snd soundcore acpi_cpufreq i915 video button drm_kms_helper drm mperf freq_table Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751291] CPU: 1 PID: 16650 Comm: fbo-maxsize Not tainted 3.11.0-rc4_nightlytop_d93f59_debug_20130814_+ #6957 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751334] Hardware name: Dell Inc. OptiPlex 9010/03JR84, BIOS A01 05/04/2012 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751365] task: ffff8800563b3f00 ti: ffff88004bdf4000 task.ti: ffff88004bdf4000 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751396] RIP: 0010:[<ffffffffa008fb37>] [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751446] RSP: 0018:ffff88004bdf5958 EFLAGS: 00010246 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751469] RAX: 0000000000000000 RBX: ffff8801135e0000 RCX: ffff8800ad3bf8e0 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751499] RDX: ffff8800ad3bf8e0 RSI: 0000000000000000 RDI: ffff8801007ee780 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751529] RBP: ffff88004bdf5978 R08: ffff8800ad3bf8e0 R09: 0000000000000000 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751559] R10: ffffffff86ca1810 R11: ffff880036a17101 R12: ffff8801007ee780 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751589] R13: 0000000000018001 R14: ffff880118c4e000 R15: ffff8801007ee780 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751619] FS: 00007f401a0ce740(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751653] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751677] CR2: 0000000000000008 CR3: 000000005635c000 CR4: 00000000001407e0 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751707] Stack: Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751717] ffff8801007ee780 ffff88005c253180 0000000000018000 ffff8801135e0000 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751752] ffff88004bdf59a8 ffffffffa0088e55 0000000000000011 ffff8801007eec00 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751788] 0000000000018000 ffff880036a17101 ffff88004bdf5a08 ffffffffa0089026 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751823] Call Trace: Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751843] [<ffffffffa0088e55>] i915_vma_unbind+0xdf/0x1ab [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751875] [<ffffffffa0089026>] __i915_gem_shrink+0x105/0x177 [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751909] [<ffffffffa0089452>] i915_gem_object_get_pages_gtt+0x108/0x309 [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751947] [<ffffffffa0085ba9>] i915_gem_object_get_pages+0x61/0x90 [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751983] [<ffffffffa008f22b>] ? gen6_ppgtt_insert_entries+0x103/0x125 [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752019] [<ffffffffa008a113>] i915_gem_object_pin+0x1fa/0x5df [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752054] [<ffffffffa008cdfe>] i915_gem_execbuffer_reserve_object.isra.6+0x8d/0x1bc [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752094] [<ffffffffa008d156>] i915_gem_execbuffer_reserve+0x229/0x367 [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752131] [<ffffffffa008dbf6>] i915_gem_do_execbuffer.isra.12+0x4dc/0xf3a [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752166] [<ffffffff810fc823>] ? might_fault+0x40/0x90 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752195] [<ffffffffa008eb89>] i915_gem_execbuffer2+0x187/0x222 [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752231] [<ffffffffa000971c>] drm_ioctl+0x308/0x442 [drm] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752261] [<ffffffffa008ea02>] ? i915_gem_execbuffer+0x3ae/0x3ae [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752292] [<ffffffff817db156>] ? __do_page_fault+0x3dd/0x481 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752318] [<ffffffff8112fdba>] vfs_ioctl+0x26/0x39 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752341] [<ffffffff811306a2>] do_vfs_ioctl+0x40e/0x451 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752366] [<ffffffff817deda7>] ? sysret_check+0x1b/0x56 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752391] [<ffffffff8113073c>] SyS_ioctl+0x57/0x87 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752415] [<ffffffff8135bbfe>] ? trace_hardirqs_on_thunk+0x3a/0x3f Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752443] [<ffffffff817ded82>] system_call_fastpath+0x16/0x1b Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752469] Code: 48 c7 c6 84 30 0e a0 31 c0 e8 d0 e9 f7 ff bf c6 a7 00 00 e8 07 af 2c e1 41 f6 84 24 03 01 00 00 10 75 44 49 8b 84 24 08 01 00 00 <8b> 50 08 48 8b 30 49 8b 86 b0 04 00 00 48 89 c7 48 81 c7 98 00 Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752624] RIP [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915] Aug 15 20:12:37 x-ivb9 kernel: [ 8081.753750] RSP <ffff88004bdf5958> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.754851] CR2: 0000000000000008 The second thing to correct is the "only one vma for now" check in vma_unbind - since vma_destroy isn't always called the obj->vma_list might not be empty. Instead check that the vma list is singular at the beginning of vma_unbind. This is also more symmetric with bind_to_vm. Our first attempting at fixing this was commit 1be81a2f2cfd8789a627401d470423358fba2d76 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Aug 20 12:56:40 2013 +0100 drm/i915: Don't destroy the vma placeholder during execbuffer reservation Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Ben Widawsky <ben@bwidawsk.net> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68298 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68171 Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> --- drivers/gpu/drm/i915/i915_gem.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-)