diff mbox

drm/i915: More vma fixups around unbind/destroy

Message ID 1377165321-29871-1-git-send-email-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show

Commit Message

Daniel Vetter Aug. 22, 2013, 9:55 a.m. UTC
The important bugfix here is that we must not unlink the vma when
we keep it around as a placeholder for the execbuf code. Since then we
won't find it again when execbuf gets interrupt and restarted and
create a 2nd vma. And since the code as-is isn't fit yet to deal with
more than one vma, hilarity ensues.

Specifically the dma map/unmap of the sg table isn't adjusted for
multiple vmas yet and will blow up like this:

Aug 15 20:12:37 x-ivb9 kernel: [ 8081.750976] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751021] IP: [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751073] PGD 56bb5067 PUD ad3dd067 PMD 0
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751096] Oops: 0000 [#1] SMP
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751114] Modules linked in: tcp_lp ppdev parport_pc lp parport ipv6 dm_mod dcdbas snd_hda_codec_hdmi pcspkr snd_hda_codec_realtek serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support snd_hda_intel snd_hda_codec lpc_ich snd_hwdep mfd_core snd_pcm snd_page_alloc snd_timer snd soundcore acpi_cpufreq i915 video button drm_kms_helper drm mperf freq_table
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751291] CPU: 1 PID: 16650 Comm: fbo-maxsize Not tainted 3.11.0-rc4_nightlytop_d93f59_debug_20130814_+ #6957
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751334] Hardware name: Dell Inc. OptiPlex 9010/03JR84, BIOS A01 05/04/2012
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751365] task: ffff8800563b3f00 ti: ffff88004bdf4000 task.ti: ffff88004bdf4000
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751396] RIP: 0010:[<ffffffffa008fb37>]  [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751446] RSP: 0018:ffff88004bdf5958  EFLAGS: 00010246
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751469] RAX: 0000000000000000 RBX: ffff8801135e0000 RCX: ffff8800ad3bf8e0
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751499] RDX: ffff8800ad3bf8e0 RSI: 0000000000000000 RDI: ffff8801007ee780
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751529] RBP: ffff88004bdf5978 R08: ffff8800ad3bf8e0 R09: 0000000000000000
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751559] R10: ffffffff86ca1810 R11: ffff880036a17101 R12: ffff8801007ee780
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751589] R13: 0000000000018001 R14: ffff880118c4e000 R15: ffff8801007ee780
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751619] FS:  00007f401a0ce740(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751653] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751677] CR2: 0000000000000008 CR3: 000000005635c000 CR4: 00000000001407e0
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751707] Stack:
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751717]  ffff8801007ee780 ffff88005c253180 0000000000018000 ffff8801135e0000
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751752]  ffff88004bdf59a8 ffffffffa0088e55 0000000000000011 ffff8801007eec00
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751788]  0000000000018000 ffff880036a17101 ffff88004bdf5a08 ffffffffa0089026
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751823] Call Trace:
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751843]  [<ffffffffa0088e55>] i915_vma_unbind+0xdf/0x1ab [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751875]  [<ffffffffa0089026>] __i915_gem_shrink+0x105/0x177 [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751909]  [<ffffffffa0089452>] i915_gem_object_get_pages_gtt+0x108/0x309 [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751947]  [<ffffffffa0085ba9>] i915_gem_object_get_pages+0x61/0x90 [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751983]  [<ffffffffa008f22b>] ? gen6_ppgtt_insert_entries+0x103/0x125 [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752019]  [<ffffffffa008a113>] i915_gem_object_pin+0x1fa/0x5df [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752054]  [<ffffffffa008cdfe>] i915_gem_execbuffer_reserve_object.isra.6+0x8d/0x1bc [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752094]  [<ffffffffa008d156>] i915_gem_execbuffer_reserve+0x229/0x367 [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752131]  [<ffffffffa008dbf6>] i915_gem_do_execbuffer.isra.12+0x4dc/0xf3a [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752166]  [<ffffffff810fc823>] ? might_fault+0x40/0x90
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752195]  [<ffffffffa008eb89>] i915_gem_execbuffer2+0x187/0x222 [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752231]  [<ffffffffa000971c>] drm_ioctl+0x308/0x442 [drm]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752261]  [<ffffffffa008ea02>] ? i915_gem_execbuffer+0x3ae/0x3ae [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752292]  [<ffffffff817db156>] ? __do_page_fault+0x3dd/0x481
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752318]  [<ffffffff8112fdba>] vfs_ioctl+0x26/0x39
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752341]  [<ffffffff811306a2>] do_vfs_ioctl+0x40e/0x451
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752366]  [<ffffffff817deda7>] ? sysret_check+0x1b/0x56
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752391]  [<ffffffff8113073c>] SyS_ioctl+0x57/0x87
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752415]  [<ffffffff8135bbfe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752443]  [<ffffffff817ded82>] system_call_fastpath+0x16/0x1b
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752469] Code: 48 c7 c6 84 30 0e a0 31 c0 e8 d0 e9 f7 ff bf c6 a7 00 00 e8 07 af 2c e1 41 f6 84 24 03 01 00 00 10 75 44 49 8b 84 24 08 01 00 00 <8b> 50 08 48 8b 30 49 8b 86 b0 04 00 00 48 89 c7 48 81 c7 98 00
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752624] RIP  [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.753750]  RSP <ffff88004bdf5958>
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.754851] CR2: 0000000000000008

The second thing to correct is the "only one vma for now" check in
vma_unbind - since vma_destroy isn't always called the obj->vma_list
might not be empty. Instead check that the vma list is singular at the
beginning of vma_unbind. This is also more symmetric with bind_to_vm.

Our first attempting at fixing this was

commit 1be81a2f2cfd8789a627401d470423358fba2d76
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Aug 20 12:56:40 2013 +0100

    drm/i915: Don't destroy the vma placeholder during execbuffer reservation

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ben Widawsky <ben@bwidawsk.net>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68298
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68171
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_gem.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

Comments

Ben Widawsky Aug. 22, 2013, 4:19 p.m. UTC | #1
On Thu, Aug 22, 2013 at 11:55:21AM +0200, Daniel Vetter wrote:
> The important bugfix here is that we must not unlink the vma when
> we keep it around as a placeholder for the execbuf code. Since then we
> won't find it again when execbuf gets interrupt and restarted and
> create a 2nd vma. And since the code as-is isn't fit yet to deal with
> more than one vma, hilarity ensues.

This was the primary distinction between my vma->busy patch, and the
patch you actually merged from Chris. AFAICT, you've adopted my
vma->busy behavior which I remember being called out as a bug by Chris
(and being convinced he was right).

So can you explain for my sanity how this differs in behavior from
vma->busy?

> 
> Specifically the dma map/unmap of the sg table isn't adjusted for
> multiple vmas yet and will blow up like this:
> 
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.750976] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751021] IP: [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751073] PGD 56bb5067 PUD ad3dd067 PMD 0
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751096] Oops: 0000 [#1] SMP
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751114] Modules linked in: tcp_lp ppdev parport_pc lp parport ipv6 dm_mod dcdbas snd_hda_codec_hdmi pcspkr snd_hda_codec_realtek serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support snd_hda_intel snd_hda_codec lpc_ich snd_hwdep mfd_core snd_pcm snd_page_alloc snd_timer snd soundcore acpi_cpufreq i915 video button drm_kms_helper drm mperf freq_table
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751291] CPU: 1 PID: 16650 Comm: fbo-maxsize Not tainted 3.11.0-rc4_nightlytop_d93f59_debug_20130814_+ #6957
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751334] Hardware name: Dell Inc. OptiPlex 9010/03JR84, BIOS A01 05/04/2012
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751365] task: ffff8800563b3f00 ti: ffff88004bdf4000 task.ti: ffff88004bdf4000
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751396] RIP: 0010:[<ffffffffa008fb37>]  [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751446] RSP: 0018:ffff88004bdf5958  EFLAGS: 00010246
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751469] RAX: 0000000000000000 RBX: ffff8801135e0000 RCX: ffff8800ad3bf8e0
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751499] RDX: ffff8800ad3bf8e0 RSI: 0000000000000000 RDI: ffff8801007ee780
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751529] RBP: ffff88004bdf5978 R08: ffff8800ad3bf8e0 R09: 0000000000000000
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751559] R10: ffffffff86ca1810 R11: ffff880036a17101 R12: ffff8801007ee780
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751589] R13: 0000000000018001 R14: ffff880118c4e000 R15: ffff8801007ee780
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751619] FS:  00007f401a0ce740(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751653] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751677] CR2: 0000000000000008 CR3: 000000005635c000 CR4: 00000000001407e0
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751707] Stack:
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751717]  ffff8801007ee780 ffff88005c253180 0000000000018000 ffff8801135e0000
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751752]  ffff88004bdf59a8 ffffffffa0088e55 0000000000000011 ffff8801007eec00
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751788]  0000000000018000 ffff880036a17101 ffff88004bdf5a08 ffffffffa0089026
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751823] Call Trace:
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751843]  [<ffffffffa0088e55>] i915_vma_unbind+0xdf/0x1ab [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751875]  [<ffffffffa0089026>] __i915_gem_shrink+0x105/0x177 [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751909]  [<ffffffffa0089452>] i915_gem_object_get_pages_gtt+0x108/0x309 [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751947]  [<ffffffffa0085ba9>] i915_gem_object_get_pages+0x61/0x90 [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751983]  [<ffffffffa008f22b>] ? gen6_ppgtt_insert_entries+0x103/0x125 [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752019]  [<ffffffffa008a113>] i915_gem_object_pin+0x1fa/0x5df [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752054]  [<ffffffffa008cdfe>] i915_gem_execbuffer_reserve_object.isra.6+0x8d/0x1bc [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752094]  [<ffffffffa008d156>] i915_gem_execbuffer_reserve+0x229/0x367 [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752131]  [<ffffffffa008dbf6>] i915_gem_do_execbuffer.isra.12+0x4dc/0xf3a [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752166]  [<ffffffff810fc823>] ? might_fault+0x40/0x90
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752195]  [<ffffffffa008eb89>] i915_gem_execbuffer2+0x187/0x222 [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752231]  [<ffffffffa000971c>] drm_ioctl+0x308/0x442 [drm]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752261]  [<ffffffffa008ea02>] ? i915_gem_execbuffer+0x3ae/0x3ae [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752292]  [<ffffffff817db156>] ? __do_page_fault+0x3dd/0x481
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752318]  [<ffffffff8112fdba>] vfs_ioctl+0x26/0x39
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752341]  [<ffffffff811306a2>] do_vfs_ioctl+0x40e/0x451
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752366]  [<ffffffff817deda7>] ? sysret_check+0x1b/0x56
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752391]  [<ffffffff8113073c>] SyS_ioctl+0x57/0x87
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752415]  [<ffffffff8135bbfe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752443]  [<ffffffff817ded82>] system_call_fastpath+0x16/0x1b
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752469] Code: 48 c7 c6 84 30 0e a0 31 c0 e8 d0 e9 f7 ff bf c6 a7 00 00 e8 07 af 2c e1 41 f6 84 24 03 01 00 00 10 75 44 49 8b 84 24 08 01 00 00 <8b> 50 08 48 8b 30 49 8b 86 b0 04 00 00 48 89 c7 48 81 c7 98 00
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752624] RIP  [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.753750]  RSP <ffff88004bdf5958>
> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.754851] CR2: 0000000000000008
> 
> The second thing to correct is the "only one vma for now" check in
> vma_unbind - since vma_destroy isn't always called the obj->vma_list
> might not be empty. Instead check that the vma list is singular at the
> beginning of vma_unbind. This is also more symmetric with bind_to_vm.

It's not a "correction" exactly. It's to address the fact that the patch
makes the destroy conditional. The old assertion was correct.

> 
> Our first attempting at fixing this was
> 
> commit 1be81a2f2cfd8789a627401d470423358fba2d76
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Aug 20 12:56:40 2013 +0100
> 
>     drm/i915: Don't destroy the vma placeholder during execbuffer reservation
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68298
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68171
> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

It's not clear from the commit message if this actually fixes the bug
for you (which you seemed to be able to reproduce). Did it?

> ---
>  drivers/gpu/drm/i915/i915_gem.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index ef92c69..3e855ff 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2616,6 +2616,9 @@ int i915_vma_unbind(struct i915_vma *vma)
>  	drm_i915_private_t *dev_priv = obj->base.dev->dev_private;
>  	int ret;
>  
> +	/* For now we only ever use 1 vma per object */
> +	WARN_ON(!list_is_singular(&obj->vma_list));
> +
>  	if (list_empty(&vma->vma_link))
>  		return 0;
>  
> @@ -2661,12 +2664,12 @@ int i915_vma_unbind(struct i915_vma *vma)
>  	drm_mm_remove_node(&vma->node);
>  
>  destroy:
> -	i915_gem_vma_destroy(vma);
> +	/* Keep the vma as a placeholder in the execbuffer reservation lists */
> +	if (!list_empty(&vma->exec_list))
> +		i915_gem_vma_destroy(vma);

Please see my vma->busy question at the top. To me this looks like the
behavior I had with vma->busy where you don't unlink from the vma_list
(which actually defies the original convention I had about the
vma_list, whereby a vma exists on a vma list IFF it has space allocated
in some address space). I think this is necessary though, as there are
really three states:
1. Bound
2. Temporarily unbound <-- this is the problematic one
3. Unbound

>  
>  	/* Since the unbound list is global, only move to that list if
> -	 * no more VMAs exist.
> -	 * NB: Until we have real VMAs there will only ever be one */
> -	WARN_ON(!list_empty(&obj->vma_list));
> +	 * no more VMAs exist. */
>  	if (list_empty(&obj->vma_list))
>  		list_move_tail(&obj->global_list, &dev_priv->mm.unbound_list);
>  
> @@ -4171,10 +4174,6 @@ void i915_gem_vma_destroy(struct i915_vma *vma)
>  	WARN_ON(vma->node.allocated);
>  	list_del(&vma->vma_link);
>  
> -	/* Keep the vma as a placeholder in the execbuffer reservation lists */
> -	if (!list_empty(&vma->exec_list))
> -		return;
> -
>  	kfree(vma);
>  }

If you really wan to remove this hunk, I think you should do it as a
revert (or remove it from the history completely if it isn't too late).
Instead though I think turning this into a WARN is still a valid thing
to do. Unless I've missed something. It shouldn't ever be safe to
destroy a VMA on an exec_list because we use the same list element in
eviction and if we accidentally destroy in eviction (called from
execbuf) we're in pretty big trouble.


Finally, I still think Chris' original suggestion of creating a new
member for eviction would make a lot of these bugs at least a little bit
easier to find. It's always hard to say though until you actually do it.
Daniel Vetter Aug. 22, 2013, 7:41 p.m. UTC | #2
On Thu, Aug 22, 2013 at 6:19 PM, Ben Widawsky <ben@bwidawsk.net> wrote:
> On Thu, Aug 22, 2013 at 11:55:21AM +0200, Daniel Vetter wrote:
>> The important bugfix here is that we must not unlink the vma when
>> we keep it around as a placeholder for the execbuf code. Since then we
>> won't find it again when execbuf gets interrupt and restarted and
>> create a 2nd vma. And since the code as-is isn't fit yet to deal with
>> more than one vma, hilarity ensues.
>
> This was the primary distinction between my vma->busy patch, and the
> patch you actually merged from Chris. AFAICT, you've adopted my
> vma->busy behavior which I remember being called out as a bug by Chris
> (and being convinced he was right).
>
> So can you explain for my sanity how this differs in behavior from
> vma->busy?

It should now match in behaviour with your busy patches I think. I
still believe that the more explicit vma->exec_list check is clearer
and introduces less redundancy.

>> Specifically the dma map/unmap of the sg table isn't adjusted for
>> multiple vmas yet and will blow up like this:
>>
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.750976] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751021] IP: [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751073] PGD 56bb5067 PUD ad3dd067 PMD 0
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751096] Oops: 0000 [#1] SMP
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751114] Modules linked in: tcp_lp ppdev parport_pc lp parport ipv6 dm_mod dcdbas snd_hda_codec_hdmi pcspkr snd_hda_codec_realtek serio_raw i2c_i801 iTCO_wdt iTCO_vendor_support snd_hda_intel snd_hda_codec lpc_ich snd_hwdep mfd_core snd_pcm snd_page_alloc snd_timer snd soundcore acpi_cpufreq i915 video button drm_kms_helper drm mperf freq_table
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751291] CPU: 1 PID: 16650 Comm: fbo-maxsize Not tainted 3.11.0-rc4_nightlytop_d93f59_debug_20130814_+ #6957
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751334] Hardware name: Dell Inc. OptiPlex 9010/03JR84, BIOS A01 05/04/2012
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751365] task: ffff8800563b3f00 ti: ffff88004bdf4000 task.ti: ffff88004bdf4000
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751396] RIP: 0010:[<ffffffffa008fb37>]  [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751446] RSP: 0018:ffff88004bdf5958  EFLAGS: 00010246
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751469] RAX: 0000000000000000 RBX: ffff8801135e0000 RCX: ffff8800ad3bf8e0
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751499] RDX: ffff8800ad3bf8e0 RSI: 0000000000000000 RDI: ffff8801007ee780
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751529] RBP: ffff88004bdf5978 R08: ffff8800ad3bf8e0 R09: 0000000000000000
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751559] R10: ffffffff86ca1810 R11: ffff880036a17101 R12: ffff8801007ee780
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751589] R13: 0000000000018001 R14: ffff880118c4e000 R15: ffff8801007ee780
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751619] FS:  00007f401a0ce740(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751653] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751677] CR2: 0000000000000008 CR3: 000000005635c000 CR4: 00000000001407e0
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751707] Stack:
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751717]  ffff8801007ee780 ffff88005c253180 0000000000018000 ffff8801135e0000
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751752]  ffff88004bdf59a8 ffffffffa0088e55 0000000000000011 ffff8801007eec00
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751788]  0000000000018000 ffff880036a17101 ffff88004bdf5a08 ffffffffa0089026
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751823] Call Trace:
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751843]  [<ffffffffa0088e55>] i915_vma_unbind+0xdf/0x1ab [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751875]  [<ffffffffa0089026>] __i915_gem_shrink+0x105/0x177 [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751909]  [<ffffffffa0089452>] i915_gem_object_get_pages_gtt+0x108/0x309 [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751947]  [<ffffffffa0085ba9>] i915_gem_object_get_pages+0x61/0x90 [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.751983]  [<ffffffffa008f22b>] ? gen6_ppgtt_insert_entries+0x103/0x125 [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752019]  [<ffffffffa008a113>] i915_gem_object_pin+0x1fa/0x5df [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752054]  [<ffffffffa008cdfe>] i915_gem_execbuffer_reserve_object.isra.6+0x8d/0x1bc [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752094]  [<ffffffffa008d156>] i915_gem_execbuffer_reserve+0x229/0x367 [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752131]  [<ffffffffa008dbf6>] i915_gem_do_execbuffer.isra.12+0x4dc/0xf3a [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752166]  [<ffffffff810fc823>] ? might_fault+0x40/0x90
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752195]  [<ffffffffa008eb89>] i915_gem_execbuffer2+0x187/0x222 [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752231]  [<ffffffffa000971c>] drm_ioctl+0x308/0x442 [drm]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752261]  [<ffffffffa008ea02>] ? i915_gem_execbuffer+0x3ae/0x3ae [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752292]  [<ffffffff817db156>] ? __do_page_fault+0x3dd/0x481
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752318]  [<ffffffff8112fdba>] vfs_ioctl+0x26/0x39
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752341]  [<ffffffff811306a2>] do_vfs_ioctl+0x40e/0x451
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752366]  [<ffffffff817deda7>] ? sysret_check+0x1b/0x56
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752391]  [<ffffffff8113073c>] SyS_ioctl+0x57/0x87
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752415]  [<ffffffff8135bbfe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752443]  [<ffffffff817ded82>] system_call_fastpath+0x16/0x1b
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752469] Code: 48 c7 c6 84 30 0e a0 31 c0 e8 d0 e9 f7 ff bf c6 a7 00 00 e8 07 af 2c e1 41 f6 84 24 03 01 00 00 10 75 44 49 8b 84 24 08 01 00 00 <8b> 50 08 48 8b 30 49 8b 86 b0 04 00 00 48 89 c7 48 81 c7 98 00
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.752624] RIP  [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.753750]  RSP <ffff88004bdf5958>
>> Aug 15 20:12:37 x-ivb9 kernel: [ 8081.754851] CR2: 0000000000000008
>>
>> The second thing to correct is the "only one vma for now" check in
>> vma_unbind - since vma_destroy isn't always called the obj->vma_list
>> might not be empty. Instead check that the vma list is singular at the
>> beginning of vma_unbind. This is also more symmetric with bind_to_vm.
>
> It's not a "correction" exactly. It's to address the fact that the patch
> makes the destroy conditional. The old assertion was correct.

Yeah, "As a consequence we need to ..." or so is better.

>> Our first attempting at fixing this was
>>
>> commit 1be81a2f2cfd8789a627401d470423358fba2d76
>> Author: Chris Wilson <chris@chris-wilson.co.uk>
>> Date:   Tue Aug 20 12:56:40 2013 +0100
>>
>>     drm/i915: Don't destroy the vma placeholder during execbuffer reservation
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Ben Widawsky <ben@bwidawsk.net>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68298
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=68171
>> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> It's not clear from the commit message if this actually fixes the bug
> for you (which you seemed to be able to reproduce). Did it?

Nope, just hard looking at the Oopses, haven't yet reproduced the bug
here. But now that I have a theory what's busted that should help a
bit with writing igts ...

>> ---
>>  drivers/gpu/drm/i915/i915_gem.c | 15 +++++++--------
>>  1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> index ef92c69..3e855ff 100644
>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> @@ -2616,6 +2616,9 @@ int i915_vma_unbind(struct i915_vma *vma)
>>       drm_i915_private_t *dev_priv = obj->base.dev->dev_private;
>>       int ret;
>>
>> +     /* For now we only ever use 1 vma per object */
>> +     WARN_ON(!list_is_singular(&obj->vma_list));
>> +
>>       if (list_empty(&vma->vma_link))
>>               return 0;
>>
>> @@ -2661,12 +2664,12 @@ int i915_vma_unbind(struct i915_vma *vma)
>>       drm_mm_remove_node(&vma->node);
>>
>>  destroy:
>> -     i915_gem_vma_destroy(vma);
>> +     /* Keep the vma as a placeholder in the execbuffer reservation lists */
>> +     if (!list_empty(&vma->exec_list))
>> +             i915_gem_vma_destroy(vma);
>
> Please see my vma->busy question at the top. To me this looks like the
> behavior I had with vma->busy where you don't unlink from the vma_list
> (which actually defies the original convention I had about the
> vma_list, whereby a vma exists on a vma list IFF it has space allocated
> in some address space). I think this is necessary though, as there are
> really three states:
> 1. Bound
> 2. Temporarily unbound <-- this is the problematic one
> 3. Unbound

Yes.

>
>>
>>       /* Since the unbound list is global, only move to that list if
>> -      * no more VMAs exist.
>> -      * NB: Until we have real VMAs there will only ever be one */
>> -     WARN_ON(!list_empty(&obj->vma_list));
>> +      * no more VMAs exist. */
>>       if (list_empty(&obj->vma_list))
>>               list_move_tail(&obj->global_list, &dev_priv->mm.unbound_list);
>>
>> @@ -4171,10 +4174,6 @@ void i915_gem_vma_destroy(struct i915_vma *vma)
>>       WARN_ON(vma->node.allocated);
>>       list_del(&vma->vma_link);
>>
>> -     /* Keep the vma as a placeholder in the execbuffer reservation lists */
>> -     if (!list_empty(&vma->exec_list))
>> -             return;
>> -
>>       kfree(vma);
>>  }
>
> If you really wan to remove this hunk, I think you should do it as a
> revert (or remove it from the history completely if it isn't too late).
> Instead though I think turning this into a WARN is still a valid thing
> to do. Unless I've missed something. It shouldn't ever be safe to
> destroy a VMA on an exec_list because we use the same list element in
> eviction and if we accidentally destroy in eviction (called from
> execbuf) we're in pretty big trouble.

I think I'll squash this in when merging with Chris' patch and fixup
the commit message a bit. For more paranoia in the evict code Chris
suggested to add a check in mark_free for list_empty(vma->exec_list).
Would be good in a follow-up patch imo.

> Finally, I still think Chris' original suggestion of creating a new
> member for eviction would make a lot of these bugs at least a little bit
> easier to find. It's always hard to say though until you actually do it.

Yeah, we could just go back to the old ways of using the lru lists to
unwind the eviction roaster (i.e. active/inactive vm lists now).
That's also why I think we could just merge the active/inactive list
together into one, since without that merge it'll be trouble. But when
Chris merged the fair lru eviction code he frobbed it a bit to use the
current temporary list.

Adding yet another list is imo wasteful ;-)

Cheers, Daniel
Ben Widawsky Aug. 22, 2013, 8:37 p.m. UTC | #3
On Thu, Aug 22, 2013 at 09:41:07PM +0200, Daniel Vetter wrote:

[snip]

> >
> > It's not clear from the commit message if this actually fixes the bug
> > for you (which you seemed to be able to reproduce). Did it?
> 
> Nope, just hard looking at the Oopses, haven't yet reproduced the bug
> here. But now that I have a theory what's busted that should help a
> bit with writing igts ...
> 

In that case, I think you need the same check in the failure path in
bind_to_vm. I have that fixed locally, but haven't actually tested it.

> >> ---
> >>  drivers/gpu/drm/i915/i915_gem.c | 15 +++++++--------
> >>  1 file changed, 7 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >> index ef92c69..3e855ff 100644
> >> --- a/drivers/gpu/drm/i915/i915_gem.c
> >> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >> @@ -2616,6 +2616,9 @@ int i915_vma_unbind(struct i915_vma *vma)
> >>       drm_i915_private_t *dev_priv = obj->base.dev->dev_private;
> >>       int ret;
> >>
> >> +     /* For now we only ever use 1 vma per object */
> >> +     WARN_ON(!list_is_singular(&obj->vma_list));
> >> +
> >>       if (list_empty(&vma->vma_link))
> >>               return 0;
> >>
> >> @@ -2661,12 +2664,12 @@ int i915_vma_unbind(struct i915_vma *vma)
> >>       drm_mm_remove_node(&vma->node);
> >>
> >>  destroy:
> >> -     i915_gem_vma_destroy(vma);
> >> +     /* Keep the vma as a placeholder in the execbuffer reservation lists */
> >> +     if (!list_empty(&vma->exec_list))
> >> +             i915_gem_vma_destroy(vma);
> >
> > Please see my vma->busy question at the top. To me this looks like the
> > behavior I had with vma->busy where you don't unlink from the vma_list
> > (which actually defies the original convention I had about the
> > vma_list, whereby a vma exists on a vma list IFF it has space allocated
> > in some address space). I think this is necessary though, as there are
> > really three states:
> > 1. Bound
> > 2. Temporarily unbound <-- this is the problematic one
> > 3. Unbound
> 
> Yes.
> 
> >
> >>
> >>       /* Since the unbound list is global, only move to that list if
> >> -      * no more VMAs exist.
> >> -      * NB: Until we have real VMAs there will only ever be one */
> >> -     WARN_ON(!list_empty(&obj->vma_list));
> >> +      * no more VMAs exist. */
> >>       if (list_empty(&obj->vma_list))
> >>               list_move_tail(&obj->global_list, &dev_priv->mm.unbound_list);
> >>
> >> @@ -4171,10 +4174,6 @@ void i915_gem_vma_destroy(struct i915_vma *vma)
> >>       WARN_ON(vma->node.allocated);
> >>       list_del(&vma->vma_link);
> >>
> >> -     /* Keep the vma as a placeholder in the execbuffer reservation lists */
> >> -     if (!list_empty(&vma->exec_list))
> >> -             return;
> >> -
> >>       kfree(vma);
> >>  }
> >
> > If you really wan to remove this hunk, I think you should do it as a
> > revert (or remove it from the history completely if it isn't too late).
> > Instead though I think turning this into a WARN is still a valid thing
> > to do. Unless I've missed something. It shouldn't ever be safe to
> > destroy a VMA on an exec_list because we use the same list element in
> > eviction and if we accidentally destroy in eviction (called from
> > execbuf) we're in pretty big trouble.
> 
> I think I'll squash this in when merging with Chris' patch and fixup
> the commit message a bit. For more paranoia in the evict code Chris
> suggested to add a check in mark_free for list_empty(vma->exec_list).
> Would be good in a follow-up patch imo.

I have this exact same check in my current code to debug this, I think
it makes a lot of sense.

> 
> > Finally, I still think Chris' original suggestion of creating a new
> > member for eviction would make a lot of these bugs at least a little bit
> > easier to find. It's always hard to say though until you actually do it.
> 
> Yeah, we could just go back to the old ways of using the lru lists to
> unwind the eviction roaster (i.e. active/inactive vm lists now).
> That's also why I think we could just merge the active/inactive list
> together into one, since without that merge it'll be trouble. But when
> Chris merged the fair lru eviction code he frobbed it a bit to use the
> current temporary list.
> 
> Adding yet another list is imo wasteful ;-)

I guess we'll see how difficult the bugs are to fix. It uses extra
memory, that much is indisputable.
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index ef92c69..3e855ff 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2616,6 +2616,9 @@  int i915_vma_unbind(struct i915_vma *vma)
 	drm_i915_private_t *dev_priv = obj->base.dev->dev_private;
 	int ret;
 
+	/* For now we only ever use 1 vma per object */
+	WARN_ON(!list_is_singular(&obj->vma_list));
+
 	if (list_empty(&vma->vma_link))
 		return 0;
 
@@ -2661,12 +2664,12 @@  int i915_vma_unbind(struct i915_vma *vma)
 	drm_mm_remove_node(&vma->node);
 
 destroy:
-	i915_gem_vma_destroy(vma);
+	/* Keep the vma as a placeholder in the execbuffer reservation lists */
+	if (!list_empty(&vma->exec_list))
+		i915_gem_vma_destroy(vma);
 
 	/* Since the unbound list is global, only move to that list if
-	 * no more VMAs exist.
-	 * NB: Until we have real VMAs there will only ever be one */
-	WARN_ON(!list_empty(&obj->vma_list));
+	 * no more VMAs exist. */
 	if (list_empty(&obj->vma_list))
 		list_move_tail(&obj->global_list, &dev_priv->mm.unbound_list);
 
@@ -4171,10 +4174,6 @@  void i915_gem_vma_destroy(struct i915_vma *vma)
 	WARN_ON(vma->node.allocated);
 	list_del(&vma->vma_link);
 
-	/* Keep the vma as a placeholder in the execbuffer reservation lists */
-	if (!list_empty(&vma->exec_list))
-		return;
-
 	kfree(vma);
 }