diff mbox

[RFC] drm/nouveau: fix nested locking in mmap handler

Message ID 52405F3E.4000609@canonical.com (mailing list archive)
State New, archived
Headers show

Commit Message

Maarten Lankhorst Sept. 23, 2013, 3:33 p.m. UTC
Hey,

Op 13-09-13 11:00, Peter Zijlstra schreef:
> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>> if (!bo_tryreserve()) {
>>>>>>     up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>     bo_reserve();               // Wait for the BO to become available (interruptible)
>>>>>>     bo_unreserve();           // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>     return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>> }
>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>> it seems perfectly legal to me.
>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>> at least I sincerely hope so.
>>>
>>> The thing that's wrong with that pattern is that its still not
>>> deterministic - although its a lot better than the pure trylock. Because
>>> you have to release and re-acquire with the trylock another user might
>>> have gotten in again. Its utterly prone to starvation.
>>>
>>> The acquire+release does remove the dead/life-lock scenario from the
>>> FIFO case, since blocking on the acquire will allow the other task to
>>> run (or even get boosted on -rt).
>>>
>>> Aside from that there's nothing particularly wrong with it and lockdep
>>> should be happy afaict (but I haven't had my morning juice yet).
>> bo_reserve internally maps to a ww-mutex and task can already hold
>> ww-mutex (potentially even the same for especially nasty userspace).
> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>
I added a RFC patch below.  I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.

This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
Nouveau was a bit of a headache, but afaict it should work.

In almost all cases relocs are not updated, so I kept intact the fastpath
of not copying relocs from userspace. The slowpath tries to copy it atomically,
and if that fails it will unreserve all bo's and copy everything.

One thing to note is that the command submission ioctl may fail now with -EFAULT
if presumed cannot be updated, while the commands are submitted succesfully.

I'm not sure what the right behavior was here, and this can only happen if you
touch the memory during the ioctl or use a read-only page. Both of them are not done
in the common case.

Reviews welcome. :P

8<---

Comments

Thomas Hellstrom Sept. 24, 2013, 7:22 a.m. UTC | #1
On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
> Hey,
>
> Op 13-09-13 11:00, Peter Zijlstra schreef:
>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>> if (!bo_tryreserve()) {
>>>>>>>      up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>      bo_reserve();               // Wait for the BO to become available (interruptible)
>>>>>>>      bo_unreserve();           // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>      return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>> }
>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>> it seems perfectly legal to me.
>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>> at least I sincerely hope so.
>>>>
>>>> The thing that's wrong with that pattern is that its still not
>>>> deterministic - although its a lot better than the pure trylock. Because
>>>> you have to release and re-acquire with the trylock another user might
>>>> have gotten in again. Its utterly prone to starvation.
>>>>
>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>> run (or even get boosted on -rt).
>>>>
>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>> should be happy afaict (but I haven't had my morning juice yet).
>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>> ww-mutex (potentially even the same for especially nasty userspace).
>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>
> I added a RFC patch below.  I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>
> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
> Nouveau was a bit of a headache, but afaict it should work.
>
> In almost all cases relocs are not updated, so I kept intact the fastpath
> of not copying relocs from userspace. The slowpath tries to copy it atomically,
> and if that fails it will unreserve all bo's and copy everything.
>
> One thing to note is that the command submission ioctl may fail now with -EFAULT
> if presumed cannot be updated, while the commands are submitted succesfully.

I think the Nouveau guys need to comment further on this, but returning 
-EFAULT might break existing user-space, and that's not allowed, but 
IIRC the return value of "presumed" is only a hint, and if it's 
incorrect will only trigger future command stream patching.

Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the 
vmwgfx driver doesn't need any fixups.
>
> I'm not sure what the right behavior was here, and this can only happen if you
> touch the memory during the ioctl or use a read-only page. Both of them are not done
> in the common case.
>
> Reviews welcome. :P
>
> 8<---
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
> index e4d60e7..2964bb7 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
> @@ -445,8 +445,6 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
>   	      uint64_t user_pbbo_ptr)
>   {
>   	struct nouveau_drm *drm = chan->drm;
> -	struct drm_nouveau_gem_pushbuf_bo __user *upbbo =
> -				(void __force __user *)(uintptr_t)user_pbbo_ptr;
>   	struct nouveau_bo *nvbo;
>   	int ret, relocs = 0;
>   
> @@ -475,7 +473,7 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
>   			return ret;
>   		}
>   
> -		if (nv_device(drm->device)->card_type < NV_50) {
> +		if (nv_device(drm->device)->card_type < NV_50 && !relocs) {
>   			if (nvbo->bo.offset == b->presumed.offset &&
>   			    ((nvbo->bo.mem.mem_type == TTM_PL_VRAM &&
>   			      b->presumed.domain & NOUVEAU_GEM_DOMAIN_VRAM) ||
> @@ -483,53 +481,86 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
>   			      b->presumed.domain & NOUVEAU_GEM_DOMAIN_GART)))
>   				continue;
>   
> -			if (nvbo->bo.mem.mem_type == TTM_PL_TT)
> -				b->presumed.domain = NOUVEAU_GEM_DOMAIN_GART;
> -			else
> -				b->presumed.domain = NOUVEAU_GEM_DOMAIN_VRAM;
> -			b->presumed.offset = nvbo->bo.offset;
> -			b->presumed.valid = 0;
> -			relocs++;
> -
> -			if (DRM_COPY_TO_USER(&upbbo[nvbo->pbbo_index].presumed,
> -					     &b->presumed, sizeof(b->presumed)))
> -				return -EFAULT;
> +			relocs = 1;
>   		}
>   	}
>   
>   	return relocs;
>   }
>   
> +static inline void
> +u_free(void *addr)
> +{
> +	if (!is_vmalloc_addr(addr))
> +		kfree(addr);
> +	else
> +		vfree(addr);
> +}

Isn't there a DRM utilty for this?

> +
> +static inline void *
> +u_memcpya(uint64_t user, unsigned nmemb, unsigned size, unsigned inatomic)
> +{
> +	void *mem;
> +	void __user *userptr = (void __force __user *)(uintptr_t)user;
> +
> +	size *= nmemb;
> +
> +	mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
> +	if (!mem)
> +		mem = vmalloc(size);
And for the above as well?

> +	if (!mem)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (inatomic && (!access_ok(VERIFY_READ, userptr, size) ||
> +	    __copy_from_user_inatomic(mem, userptr, size))) {
> +		u_free(mem);
> +		return ERR_PTR(-EFAULT);
> +	} else if (!inatomic && copy_from_user(mem, userptr, size)) {
> +		u_free(mem);
> +		return ERR_PTR(-EFAULT);
> +	}
> +
> +	return mem;
> +}
> +
> +static int
> +nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,
> +				struct drm_nouveau_gem_pushbuf *req,
> +				struct drm_nouveau_gem_pushbuf_bo *bo,
> +				struct drm_nouveau_gem_pushbuf_reloc *reloc);
> +
>   static int
>   nouveau_gem_pushbuf_validate(struct nouveau_channel *chan,
>   			     struct drm_file *file_priv,
>   			     struct drm_nouveau_gem_pushbuf_bo *pbbo,
> +			     struct drm_nouveau_gem_pushbuf *req,
>   			     uint64_t user_buffers, int nr_buffers,
> -			     struct validate_op *op, int *apply_relocs)
> +			     struct validate_op *op, int *do_reloc)
>   {
>   	struct nouveau_cli *cli = nouveau_cli(file_priv);
>   	int ret, relocs = 0;
> +	struct drm_nouveau_gem_pushbuf_reloc *reloc = NULL;
> +
> +	if (nr_buffers == 0)
> +		return 0;
>   
> +restart:
>   	INIT_LIST_HEAD(&op->vram_list);
>   	INIT_LIST_HEAD(&op->gart_list);
>   	INIT_LIST_HEAD(&op->both_list);
>   
> -	if (nr_buffers == 0)
> -		return 0;
> -
>   	ret = validate_init(chan, file_priv, pbbo, nr_buffers, op);
>   	if (unlikely(ret)) {
>   		if (ret != -ERESTARTSYS)
>   			NV_ERROR(cli, "validate_init\n");
> -		return ret;
> +		goto err;
>   	}
>   
>   	ret = validate_list(chan, cli, &op->vram_list, pbbo, user_buffers);
>   	if (unlikely(ret < 0)) {
>   		if (ret != -ERESTARTSYS)
>   			NV_ERROR(cli, "validate vram_list\n");
> -		validate_fini(op, NULL);
> -		return ret;
> +		goto err_fini;
>   	}
>   	relocs += ret;
>   
> @@ -537,8 +568,7 @@ nouveau_gem_pushbuf_validate(struct nouveau_channel *chan,
>   	if (unlikely(ret < 0)) {
>   		if (ret != -ERESTARTSYS)
>   			NV_ERROR(cli, "validate gart_list\n");
> -		validate_fini(op, NULL);
> -		return ret;
> +		goto err_fini;
>   	}
>   	relocs += ret;
>   
> @@ -546,58 +576,93 @@ nouveau_gem_pushbuf_validate(struct nouveau_channel *chan,
>   	if (unlikely(ret < 0)) {
>   		if (ret != -ERESTARTSYS)
>   			NV_ERROR(cli, "validate both_list\n");
> -		validate_fini(op, NULL);
> -		return ret;
> +		goto err_fini;
>   	}
>   	relocs += ret;
> +	if (relocs) {
> +		if (!reloc) {
> +			//reloc = u_memcpya(req->relocs, req->nr_relocs, sizeof(*reloc), 1);
> +			reloc = ERR_PTR(-EFAULT); NV_ERROR(cli, "slowpath!\n");
> +		}
> +		if (IS_ERR(reloc)) {
> +			validate_fini(op, NULL);
> +
> +			if (PTR_ERR(reloc) == -EFAULT)
> +				reloc = u_memcpya(req->relocs, req->nr_relocs, sizeof(*reloc), 0);
> +
> +			if (IS_ERR(reloc))
> +				return PTR_ERR(reloc);
> +			goto restart;
> +		}
> +
> +		ret = nouveau_gem_pushbuf_reloc_apply(cli, req, pbbo, reloc);
> +		if (ret) {
> +			NV_ERROR(cli, "reloc apply: %d\n", ret);
> +			/* No validate_fini, already called. */
> +			return ret;
> +		}
> +		u_free(reloc);
> +		*do_reloc = 1;
> +	}
>   
> -	*apply_relocs = relocs;
>   	return 0;
> -}
>   
> -static inline void
> -u_free(void *addr)
> -{
> -	if (!is_vmalloc_addr(addr))
> -		kfree(addr);
> -	else
> -		vfree(addr);
> +err_fini:
> +	validate_fini(op, NULL);
> +err:
> +	if (reloc)
> +		u_free(reloc);
> +	return ret;
>   }
>   
> -static inline void *
> -u_memcpya(uint64_t user, unsigned nmemb, unsigned size)
> +static int
> +nouveau_gem_pushbuf_reloc_copy_to_user(struct drm_nouveau_gem_pushbuf *req,
> +				       struct drm_nouveau_gem_pushbuf_bo *bo)
>   {
> -	void *mem;
> -	void __user *userptr = (void __force __user *)(uintptr_t)user;
> +	struct drm_nouveau_gem_pushbuf_bo __user *upbbo =
> +				 (void __force __user *)(uintptr_t)req->buffers;
> +	unsigned i;
>   
> -	size *= nmemb;
> +	for (i = 0; i < req->nr_buffers; ++i) {
> +		struct drm_nouveau_gem_pushbuf_bo *b = &bo[i];
>   
> -	mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
> -	if (!mem)
> -		mem = vmalloc(size);
> -	if (!mem)
> -		return ERR_PTR(-ENOMEM);
> -
> -	if (DRM_COPY_FROM_USER(mem, userptr, size)) {
> -		u_free(mem);
> -		return ERR_PTR(-EFAULT);
> +		if (!b->presumed.valid &&
> +		    copy_to_user(&upbbo[i].presumed, &b->presumed, sizeof(b->presumed)))
> +			return -EFAULT;
>   	}
> -
> -	return mem;
> +	return 0;
>   }
>   
>   static int
>   nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,
>   				struct drm_nouveau_gem_pushbuf *req,
> -				struct drm_nouveau_gem_pushbuf_bo *bo)
> +				struct drm_nouveau_gem_pushbuf_bo *bo,
> +				struct drm_nouveau_gem_pushbuf_reloc *reloc)
>   {
> -	struct drm_nouveau_gem_pushbuf_reloc *reloc = NULL;
>   	int ret = 0;
>   	unsigned i;
>   
> -	reloc = u_memcpya(req->relocs, req->nr_relocs, sizeof(*reloc));
> -	if (IS_ERR(reloc))
> -		return PTR_ERR(reloc);
> +	for (i = 0; i < req->nr_buffers; ++i) {
> +		struct drm_nouveau_gem_pushbuf_bo *b = &bo[i];
> +		struct nouveau_bo *nvbo = (void *)(unsigned long)
> +			bo[i].user_priv;
> +
> +		if (nvbo->bo.offset == b->presumed.offset &&
> +		    ((nvbo->bo.mem.mem_type == TTM_PL_VRAM &&
> +		      b->presumed.domain & NOUVEAU_GEM_DOMAIN_VRAM) ||
> +		     (nvbo->bo.mem.mem_type == TTM_PL_TT &&
> +		      b->presumed.domain & NOUVEAU_GEM_DOMAIN_GART))) {
> +			b->presumed.valid = 1;
> +			continue;
> +		}
> +
> +		if (nvbo->bo.mem.mem_type == TTM_PL_TT)
> +			b->presumed.domain = NOUVEAU_GEM_DOMAIN_GART;
> +		else
> +			b->presumed.domain = NOUVEAU_GEM_DOMAIN_VRAM;
> +		b->presumed.offset = nvbo->bo.offset;
> +		b->presumed.valid = 0;
> +	}
>   
>   	for (i = 0; i < req->nr_relocs; i++) {
>   		struct drm_nouveau_gem_pushbuf_reloc *r = &reloc[i];
> @@ -664,8 +729,6 @@ nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,
>   
>   		nouveau_bo_wr32(nvbo, r->reloc_bo_offset >> 2, data);
>   	}
> -
> -	u_free(reloc);
>   	return ret;
>   }
>   
> @@ -721,11 +784,11 @@ nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data,
>   		return nouveau_abi16_put(abi16, -EINVAL);
>   	}
>   
> -	push = u_memcpya(req->push, req->nr_push, sizeof(*push));
> +	push = u_memcpya(req->push, req->nr_push, sizeof(*push), 0);
>   	if (IS_ERR(push))
>   		return nouveau_abi16_put(abi16, PTR_ERR(push));
>   
> -	bo = u_memcpya(req->buffers, req->nr_buffers, sizeof(*bo));
> +	bo = u_memcpya(req->buffers, req->nr_buffers, sizeof(*bo), 0);
>   	if (IS_ERR(bo)) {
>   		u_free(push);
>   		return nouveau_abi16_put(abi16, PTR_ERR(bo));
> @@ -741,7 +804,7 @@ nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data,
>   	}
>   
>   	/* Validate buffer list */
> -	ret = nouveau_gem_pushbuf_validate(chan, file_priv, bo, req->buffers,
> +	ret = nouveau_gem_pushbuf_validate(chan, file_priv, bo, req, req->buffers,
>   					   req->nr_buffers, &op, &do_reloc);
>   	if (ret) {
>   		if (ret != -ERESTARTSYS)
> @@ -749,15 +812,6 @@ nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data,
>   		goto out_prevalid;
>   	}
>   
> -	/* Apply any relocations that are required */
> -	if (do_reloc) {
> -		ret = nouveau_gem_pushbuf_reloc_apply(cli, req, bo);
> -		if (ret) {
> -			NV_ERROR(cli, "reloc apply: %d\n", ret);
> -			goto out;
> -		}
> -	}
> -
>   	if (chan->dma.ib_max) {
>   		ret = nouveau_dma_wait(chan, req->nr_push + 1, 16);
>   		if (ret) {
> @@ -837,6 +891,17 @@ out:
>   	validate_fini(&op, fence);
>   	nouveau_fence_unref(&fence);
>   
> +	if (do_reloc && !ret) {
> +		ret = nouveau_gem_pushbuf_reloc_copy_to_user(req, bo);
> +		if (ret) {
> +			NV_ERROR(cli, "error copying presumed back to userspace: %d\n", ret);
> +			/*
> +			 * XXX: We return -EFAULT, but command submission
> +			 * has already been completed.
> +			 */
> +		}
> +	}
> +
>   out_prevalid:
>   	u_free(bo);
>   	u_free(push);
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index 1006c15..829e911 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -64,12 +64,9 @@ static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>   	 * for reserve, and if it fails, retry the fault after scheduling.
>   	 */
>   
> -	ret = ttm_bo_reserve(bo, true, true, false, 0);
> -	if (unlikely(ret != 0)) {
> -		if (ret == -EBUSY)
> -			set_need_resched();
> +	ret = ttm_bo_reserve(bo, true, false, false, 0);
> +	if (unlikely(ret != 0))
>   		return VM_FAULT_NOPAGE;
> -	}
>   

Actually, it's not the locking order bo:reserve -> mmap_sem that 
triggers the lockdep warning, right? but the fact that copy_from_user 
could recurse into the fault handler? Grabbing bo:reseves recursively? 
which should leave us free to choose locking order at this point.

>   	if (bdev->driver->fault_reserve_notify) {
>   		ret = bdev->driver->fault_reserve_notify(bo);
> @@ -77,6 +74,7 @@ static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>   		case 0:
>   			break;
>   		case -EBUSY:
> +			WARN_ON(1);
>   			set_need_resched();

I don't think this is used anymore, so set_need_resched might go.

>   		case -ERESTARTSYS:
>   			retval = VM_FAULT_NOPAGE;
/Thomas
Maarten Lankhorst Sept. 24, 2013, 7:34 a.m. UTC | #2
Op 24-09-13 09:22, Thomas Hellstrom schreef:
> On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
>> Hey,
>>
>> Op 13-09-13 11:00, Peter Zijlstra schreef:
>>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>>> if (!bo_tryreserve()) {
>>>>>>>>      up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>>      bo_reserve();               // Wait for the BO to become available (interruptible)
>>>>>>>>      bo_unreserve();           // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>>      return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>>> }
>>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>>> it seems perfectly legal to me.
>>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>>> at least I sincerely hope so.
>>>>>
>>>>> The thing that's wrong with that pattern is that its still not
>>>>> deterministic - although its a lot better than the pure trylock. Because
>>>>> you have to release and re-acquire with the trylock another user might
>>>>> have gotten in again. Its utterly prone to starvation.
>>>>>
>>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>>> run (or even get boosted on -rt).
>>>>>
>>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>>> should be happy afaict (but I haven't had my morning juice yet).
>>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>>> ww-mutex (potentially even the same for especially nasty userspace).
>>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>>
>> I added a RFC patch below.  I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>>
>> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
>> Nouveau was a bit of a headache, but afaict it should work.
>>
>> In almost all cases relocs are not updated, so I kept intact the fastpath
>> of not copying relocs from userspace. The slowpath tries to copy it atomically,
>> and if that fails it will unreserve all bo's and copy everything.
>>
>> One thing to note is that the command submission ioctl may fail now with -EFAULT
>> if presumed cannot be updated, while the commands are submitted succesfully.
>
> I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
>
> Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
which would probably result in libdrm crashing shortly afterwards anyway.

So I don't know whether to swallow that -EFAULT or not, which is what I asked.

And for vmwgfx just run it through PROVE_LOCKING and make sure all the copy_to/from_user call sites are called at least once, and then you can be certain it doesn't need fixups. ;)
Lockdep ftw..

>>
>> I'm not sure what the right behavior was here, and this can only happen if you
>> touch the memory during the ioctl or use a read-only page. Both of them are not done
>> in the common case.
>>
>> Reviews welcome. :P
>>
>> 8<---
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
>> index e4d60e7..2964bb7 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
>> @@ -445,8 +445,6 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
>>             uint64_t user_pbbo_ptr)
>>   {
>>       struct nouveau_drm *drm = chan->drm;
>> -    struct drm_nouveau_gem_pushbuf_bo __user *upbbo =
>> -                (void __force __user *)(uintptr_t)user_pbbo_ptr;
>>       struct nouveau_bo *nvbo;
>>       int ret, relocs = 0;
>>   @@ -475,7 +473,7 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
>>               return ret;
>>           }
>>   -        if (nv_device(drm->device)->card_type < NV_50) {
>> +        if (nv_device(drm->device)->card_type < NV_50 && !relocs) {
>>               if (nvbo->bo.offset == b->presumed.offset &&
>>                   ((nvbo->bo.mem.mem_type == TTM_PL_VRAM &&
>>                     b->presumed.domain & NOUVEAU_GEM_DOMAIN_VRAM) ||
>> @@ -483,53 +481,86 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
>>                     b->presumed.domain & NOUVEAU_GEM_DOMAIN_GART)))
>>                   continue;
>>   -            if (nvbo->bo.mem.mem_type == TTM_PL_TT)
>> -                b->presumed.domain = NOUVEAU_GEM_DOMAIN_GART;
>> -            else
>> -                b->presumed.domain = NOUVEAU_GEM_DOMAIN_VRAM;
>> -            b->presumed.offset = nvbo->bo.offset;
>> -            b->presumed.valid = 0;
>> -            relocs++;
>> -
>> -            if (DRM_COPY_TO_USER(&upbbo[nvbo->pbbo_index].presumed,
>> -                         &b->presumed, sizeof(b->presumed)))
>> -                return -EFAULT;
>> +            relocs = 1;
>>           }
>>       }
>>         return relocs;
>>   }
>>   +static inline void
>> +u_free(void *addr)
>> +{
>> +    if (!is_vmalloc_addr(addr))
>> +        kfree(addr);
>> +    else
>> +        vfree(addr);
>> +}
>
> Isn't there a DRM utilty for this?
Oops, seems you're right..

Those functions can be replaced with drm_malloc_ab and drm_free_large.
>> +
>> +static inline void *
>> +u_memcpya(uint64_t user, unsigned nmemb, unsigned size, unsigned inatomic)
>> +{
>> +    void *mem;
>> +    void __user *userptr = (void __force __user *)(uintptr_t)user;
>> +
>> +    size *= nmemb;
>> +
>> +    mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
>> +    if (!mem)
>> +        mem = vmalloc(size);
> And for the above as well?
Indeed..
>
>> ...
>>   out_prevalid:
>>       u_free(bo);
>>       u_free(push);
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> index 1006c15..829e911 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> @@ -64,12 +64,9 @@ static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>>        * for reserve, and if it fails, retry the fault after scheduling.
>>        */
>>   -    ret = ttm_bo_reserve(bo, true, true, false, 0);
>> -    if (unlikely(ret != 0)) {
>> -        if (ret == -EBUSY)
>> -            set_need_resched();
>> +    ret = ttm_bo_reserve(bo, true, false, false, 0);
>> +    if (unlikely(ret != 0))
>>           return VM_FAULT_NOPAGE;
>> -    }
>>   
>
> Actually, it's not the locking order bo:reserve -> mmap_sem that triggers the lockdep warning, right? but the fact that copy_from_user could recurse into the fault handler? Grabbing bo:reseves recursively? which should leave us free to choose locking order at this point.
Same thing.

When PROVE_LOCKING is enabled the might_fault calls in copy_to/from_user do a fake locking of mmap_sem, which means all locking errors, provided that the reservation lock is taken at least once with mmap_sem held (eg the ttm fault handler is called at least once, it can be done separately, it doesn't need to be done in the same syscall). So any bugs will be found. The only thing to worry about is that lockdep turns off after finding 1 error, so you have to make sure it doesn't bomb out before completing tests, which is sometimes a problem on early rc kernels. ;)

~Maarten
Ingo Molnar Sept. 24, 2013, 8:20 a.m. UTC | #3
* Maarten Lankhorst <maarten.lankhorst@canonical.com> wrote:

> > I think the Nouveau guys need to comment further on this, but 
> > returning -EFAULT might break existing user-space, and that's not 
> > allowed, but IIRC the return value of "presumed" is only a hint, and 
> > if it's incorrect will only trigger future command stream patching.
> >
> > Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell 
> > the vmwgfx driver doesn't need any fixups.
>
> Well because we read the list of buffer objects the presumed offsets are 
> at least read-mapped. Although I guess in the worst case the mapping 
> might disappear before the syscall copies back the data. So if -EFAULT 
> happens here then userspace messed up in some way, either by forgetting 
> to map the offsets read-write, which cannot happen with libdrm or 
> free'ing the bo list before the syscall returns, which would probably 
> result in libdrm crashing shortly afterwards anyway.
> 
> So I don't know whether to swallow that -EFAULT or not, which is what I 
> asked.

In such a case returning -EFAULT is very strongly preferred.

If there's a user-space bug, such as a context life time race between 
graphics context creation/destruction and user threads making use of the 
graphics context, then getting the -EFAULT would be very helpful to 
user-space debugging as well.

Thanks,

	Ingo
Thomas Hellstrom Sept. 24, 2013, 9:03 a.m. UTC | #4
On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
> Op 24-09-13 09:22, Thomas Hellstrom schreef:
>> On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
>>> Hey,
>>>
>>> Op 13-09-13 11:00, Peter Zijlstra schreef:
>>>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>>>> if (!bo_tryreserve()) {
>>>>>>>>>       up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>>>       bo_reserve();               // Wait for the BO to become available (interruptible)
>>>>>>>>>       bo_unreserve();           // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>>>       return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>>>> }
>>>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>>>> it seems perfectly legal to me.
>>>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>>>> at least I sincerely hope so.
>>>>>>
>>>>>> The thing that's wrong with that pattern is that its still not
>>>>>> deterministic - although its a lot better than the pure trylock. Because
>>>>>> you have to release and re-acquire with the trylock another user might
>>>>>> have gotten in again. Its utterly prone to starvation.
>>>>>>
>>>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>>>> run (or even get boosted on -rt).
>>>>>>
>>>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>>>> should be happy afaict (but I haven't had my morning juice yet).
>>>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>>>> ww-mutex (potentially even the same for especially nasty userspace).
>>>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>>>
>>> I added a RFC patch below.  I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>>>
>>> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
>>> Nouveau was a bit of a headache, but afaict it should work.
>>>
>>> In almost all cases relocs are not updated, so I kept intact the fastpath
>>> of not copying relocs from userspace. The slowpath tries to copy it atomically,
>>> and if that fails it will unreserve all bo's and copy everything.
>>>
>>> One thing to note is that the command submission ioctl may fail now with -EFAULT
>>> if presumed cannot be updated, while the commands are submitted succesfully.
>> I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
>>
>> Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
> Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
> So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
> which would probably result in libdrm crashing shortly afterwards anyway.

Hmm, is the list of buffer objects (and the "presumed" members) really 
in DRM memory? Because if it resides or may reside in anonymous system 
memory, it may well be paged out between reading and writing, in which 
case the -EFAULT return is incorrect.

In fact failures of pushbuf / execbuf *after* commands are successfully 
submitted are generally very hard to recover from. That's why the kernel 
should do whatever it takes to recover from such failures, and 
user-space should do whatever it takes to recover from copy-to-user 
failures of needed info from the kernel, and it really depends on the 
user-space usage pattern of "presumed". IIRC the original reason for 
copying it back to user-space was, that if a relocation offsets were 
patched up by the kernel, and then the process was sent a signal causing 
it to retry execbuf, then "presumed" had to be updated, otherwise it 
would be inconsistent with what's currently in the command stream, which 
is very bad. If "presumed" is, however, only used by user-space to guess 
an offset, the correct action would be to swallow the -EFAULT.

>
> So I don't know whether to swallow that -EFAULT or not, which is what I asked.
>
> And for vmwgfx just run it through PROVE_LOCKING and make sure all the copy_to/from_user call sites are called at least once, and then you can be certain it doesn't need fixups. ;)
> Lockdep ftw..
Will do that.

>
>> Actually, it's not the locking order bo:reserve -> mmap_sem that triggers the lockdep warning, right? but the fact that copy_from_user could recurse into the fault handler? Grabbing bo:reseves recursively? which should leave us free to choose locking order at this point.
> Same thing.
>
> When PROVE_LOCKING is enabled the might_fault calls in copy_to/from_user do a fake locking of mmap_sem, which means all locking errors, provided that the reservation lock is taken at least once with mmap_sem held (eg the ttm fault handler is called at least once, it can be done separately, it doesn't need to be done in the same syscall). So any bugs will be found. The only thing to worry about is that lockdep turns off after finding 1 error, so you have to make sure it doesn't bomb out before completing tests, which is sometimes a problem on early rc kernels. ;)
>
> ~Maarten

My point was that when we only have copy_[to|from]_user_inatomic inside 
any bo:reservations, the might_fault would never be called inside any 
reservations and we should, in principle, be free to choose locking 
order, provided of course it could be done cleanly in fault()?

/Thomas
Daniel Vetter Sept. 24, 2013, 9:36 a.m. UTC | #5
On Tue, Sep 24, 2013 at 11:03:37AM +0200, Thomas Hellstrom wrote:
> On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
> >Op 24-09-13 09:22, Thomas Hellstrom schreef:
> >>On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
> >>>Hey,
> >>>
> >>>Op 13-09-13 11:00, Peter Zijlstra schreef:
> >>>>On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
> >>>>>On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >>>>>>On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
> >>>>>>>>>if (!bo_tryreserve()) {
> >>>>>>>>>      up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
> >>>>>>>>>      bo_reserve();               // Wait for the BO to become available (interruptible)
> >>>>>>>>>      bo_unreserve();           // Where is bo_wait_unreserved() when we need it, Maarten :P
> >>>>>>>>>      return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
> >>>>>>>>>}
> >>>>>>>Anyway, could you describe what is wrong, with the above solution, because
> >>>>>>>it seems perfectly legal to me.
> >>>>>>Luckily the rule of law doesn't have anything to do with this stuff --
> >>>>>>at least I sincerely hope so.
> >>>>>>
> >>>>>>The thing that's wrong with that pattern is that its still not
> >>>>>>deterministic - although its a lot better than the pure trylock. Because
> >>>>>>you have to release and re-acquire with the trylock another user might
> >>>>>>have gotten in again. Its utterly prone to starvation.
> >>>>>>
> >>>>>>The acquire+release does remove the dead/life-lock scenario from the
> >>>>>>FIFO case, since blocking on the acquire will allow the other task to
> >>>>>>run (or even get boosted on -rt).
> >>>>>>
> >>>>>>Aside from that there's nothing particularly wrong with it and lockdep
> >>>>>>should be happy afaict (but I haven't had my morning juice yet).
> >>>>>bo_reserve internally maps to a ww-mutex and task can already hold
> >>>>>ww-mutex (potentially even the same for especially nasty userspace).
> >>>>OK, yes I wasn't aware of that. Yes in that case you're quite right.
> >>>>
> >>>I added a RFC patch below.  I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
> >>>
> >>>This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
> >>>Nouveau was a bit of a headache, but afaict it should work.
> >>>
> >>>In almost all cases relocs are not updated, so I kept intact the fastpath
> >>>of not copying relocs from userspace. The slowpath tries to copy it atomically,
> >>>and if that fails it will unreserve all bo's and copy everything.
> >>>
> >>>One thing to note is that the command submission ioctl may fail now with -EFAULT
> >>>if presumed cannot be updated, while the commands are submitted succesfully.
> >>I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
> >>
> >>Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
> >Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
> >So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
> >which would probably result in libdrm crashing shortly afterwards anyway.
> 
> Hmm, is the list of buffer objects (and the "presumed" members)
> really in DRM memory? Because if it resides or may reside in
> anonymous system memory, it may well be paged out between reading
> and writing, in which case the -EFAULT return is incorrect.
> 
> In fact failures of pushbuf / execbuf *after* commands are
> successfully submitted are generally very hard to recover from.
> That's why the kernel should do whatever it takes to recover from
> such failures, and user-space should do whatever it takes to recover
> from copy-to-user failures of needed info from the kernel, and it
> really depends on the user-space usage pattern of "presumed". IIRC
> the original reason for copying it back to user-space was, that if a
> relocation offsets were patched up by the kernel, and then the
> process was sent a signal causing it to retry execbuf, then
> "presumed" had to be updated, otherwise it would be inconsistent
> with what's currently in the command stream, which is very bad. If
> "presumed" is, however, only used by user-space to guess an offset,
> the correct action would be to swallow the -EFAULT.

In i915 we've had tons of fun with a regression in 3.7 where exactly this
blows up: Some of our userspace (UXA ddx specifically) retains
relocations-trees partially accross an execbuf. Which means if the kernel
updates the relocations it _must_ update the presumed offset for
otherwise things will go haywire on the next execbuf. So we can't return
-EFAULT if the userspace memory needs to be just refaulted but still need
to guarante a "correct" presumed offset.

Since we didn't come up with a way to make sure this will work in all
cases when we get an -EFAULT when writing back presumed offsets we have a
rather tricky piece of fallback logic.
- Any -EFAULT error in the fastpath will drop us into the relocation
  slowpath. The slowpath completly processes relocs anew, so any updates
  done by the fastpath are irrelevant.

- The first thing the slowpath does is set the presumed offset in the
  userspace reloc lists to an invalid address (we pick -1) to make sure
  that any subsequent execbuf with the same partial reloc tree will again
  go through the relocation update code.

- Then we do the usual slowpath, i.e. copy relocs from userspace, re-grab
  locks and then process them. The copy-back of the presumed offset
  happens with an copy_to_user_inatomic, and we ignore any errors.

Of course we try really hard to make sure that we never hit the reloc
slowpath ;-) But nowadays this is all fully tested with some nasty
testcases (and a small kernel option to disable prefaulting).

Cheers, Daniel
Maarten Lankhorst Sept. 24, 2013, 9:43 a.m. UTC | #6
Op 24-09-13 11:03, Thomas Hellstrom schreef:
> On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
>> Op 24-09-13 09:22, Thomas Hellstrom schreef:
>>> On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
>>>> Hey,
>>>>
>>>> Op 13-09-13 11:00, Peter Zijlstra schreef:
>>>>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>>>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>>>>> if (!bo_tryreserve()) {
>>>>>>>>>>       up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>>>>       bo_reserve();               // Wait for the BO to become available (interruptible)
>>>>>>>>>>       bo_unreserve();           // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>>>>       return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>>>>> }
>>>>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>>>>> it seems perfectly legal to me.
>>>>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>>>>> at least I sincerely hope so.
>>>>>>>
>>>>>>> The thing that's wrong with that pattern is that its still not
>>>>>>> deterministic - although its a lot better than the pure trylock. Because
>>>>>>> you have to release and re-acquire with the trylock another user might
>>>>>>> have gotten in again. Its utterly prone to starvation.
>>>>>>>
>>>>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>>>>> run (or even get boosted on -rt).
>>>>>>>
>>>>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>>>>> should be happy afaict (but I haven't had my morning juice yet).
>>>>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>>>>> ww-mutex (potentially even the same for especially nasty userspace).
>>>>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>>>>
>>>> I added a RFC patch below.  I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>>>>
>>>> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
>>>> Nouveau was a bit of a headache, but afaict it should work.
>>>>
>>>> In almost all cases relocs are not updated, so I kept intact the fastpath
>>>> of not copying relocs from userspace. The slowpath tries to copy it atomically,
>>>> and if that fails it will unreserve all bo's and copy everything.
>>>>
>>>> One thing to note is that the command submission ioctl may fail now with -EFAULT
>>>> if presumed cannot be updated, while the commands are submitted succesfully.
>>> I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
>>>
>>> Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
>> Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
>> So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
>> which would probably result in libdrm crashing shortly afterwards anyway.
>
> Hmm, is the list of buffer objects (and the "presumed" members) really in DRM memory? Because if it resides or may reside in anonymous system memory, it may well be paged out between reading and writing, in which case the -EFAULT return is incorrect.
It may be swapped out, but I don't use read.*atomic copy calls, faults are allowed to happen at that point and no -EFAULT will be returned from
that. The only -EFAULT that can happen is if the memory ends up being read-only, or ceased to exist.

> In fact failures of pushbuf / execbuf *after* commands are successfully submitted are generally very hard to recover from. That's why the kernel should do whatever it takes to recover from such failures, and user-space should do whatever it takes to recover from copy-to-user failures of needed info from the kernel, and it really depends on the user-space usage pattern of "presumed". IIRC the original reason for copying it back to user-space was, that if a relocation offsets were patched up by the kernel, and then the process was sent a signal causing it to retry execbuf, then "presumed" had to be updated, otherwise it would be inconsistent with what's currently in the command stream, which is very bad. If "presumed" is, however, only used by user-space to guess an offset, the correct action would be to swallow the -EFAULT.
Yeah it's a guess from userspace. But it's probably still a programming error by userspace if it uses a read-only
mapping for presumed, or a mapping that's invalidated. :P But there's nothing I can do at that point,
I can't undo something I just committed. This is why I asked whether I should swallow it or not, because it's
unclear to me. :) There's not much of an impact to userspace if I swallow it, but userspace did deliberately
mess up the call.

>>
>> So I don't know whether to swallow that -EFAULT or not, which is what I asked.
>>
>> And for vmwgfx just run it through PROVE_LOCKING and make sure all the copy_to/from_user call sites are called at least once, and then you can be certain it doesn't need fixups. ;)
>> Lockdep ftw..
> Will do that.
>
>>
>>> Actually, it's not the locking order bo:reserve -> mmap_sem that triggers the lockdep warning, right? but the fact that copy_from_user could recurse into the fault handler? Grabbing bo:reseves recursively? which should leave us free to choose locking order at this point.
>> Same thing.
>>
>> When PROVE_LOCKING is enabled the might_fault calls in copy_to/from_user do a fake locking of mmap_sem, which means all locking errors, provided that the reservation lock is taken at least once with mmap_sem held (eg the ttm fault handler is called at least once, it can be done separately, it doesn't need to be done in the same syscall). So any bugs will be found. The only thing to worry about is that lockdep turns off after finding 1 error, so you have to make sure it doesn't bomb out before completing tests, which is sometimes a problem on early rc kernels. ;)
>
> My point was that when we only have copy_[to|from]_user_inatomic inside any bo:reservations, the might_fault would never be called inside any reservations and we should, in principle, be free to choose locking order, provided of course it could be done cleanly in fault()?
I think it makes sense to keep mmap_sem the outer lock here, and not do scary things in fault. Especially if the mm locking is going to be changed in the future. But you're correct, if holding reservations only inatomic should be used.

~Maarten
Maarten Lankhorst Sept. 24, 2013, 10:11 a.m. UTC | #7
Op 24-09-13 11:36, Daniel Vetter schreef:
> On Tue, Sep 24, 2013 at 11:03:37AM +0200, Thomas Hellstrom wrote:
>> On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
>>> Op 24-09-13 09:22, Thomas Hellstrom schreef:
>>>> On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
>>>>> Hey,
>>>>>
>>>>> Op 13-09-13 11:00, Peter Zijlstra schreef:
>>>>>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>>>>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>>>>>> if (!bo_tryreserve()) {
>>>>>>>>>>>      up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>>>>>      bo_reserve();               // Wait for the BO to become available (interruptible)
>>>>>>>>>>>      bo_unreserve();           // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>>>>>      return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>>>>>> }
>>>>>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>>>>>> it seems perfectly legal to me.
>>>>>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>>>>>> at least I sincerely hope so.
>>>>>>>>
>>>>>>>> The thing that's wrong with that pattern is that its still not
>>>>>>>> deterministic - although its a lot better than the pure trylock. Because
>>>>>>>> you have to release and re-acquire with the trylock another user might
>>>>>>>> have gotten in again. Its utterly prone to starvation.
>>>>>>>>
>>>>>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>>>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>>>>>> run (or even get boosted on -rt).
>>>>>>>>
>>>>>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>>>>>> should be happy afaict (but I haven't had my morning juice yet).
>>>>>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>>>>>> ww-mutex (potentially even the same for especially nasty userspace).
>>>>>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>>>>>
>>>>> I added a RFC patch below.  I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>>>>>
>>>>> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
>>>>> Nouveau was a bit of a headache, but afaict it should work.
>>>>>
>>>>> In almost all cases relocs are not updated, so I kept intact the fastpath
>>>>> of not copying relocs from userspace. The slowpath tries to copy it atomically,
>>>>> and if that fails it will unreserve all bo's and copy everything.
>>>>>
>>>>> One thing to note is that the command submission ioctl may fail now with -EFAULT
>>>>> if presumed cannot be updated, while the commands are submitted succesfully.
>>>> I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
>>>>
>>>> Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
>>> Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
>>> So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
>>> which would probably result in libdrm crashing shortly afterwards anyway.
>> Hmm, is the list of buffer objects (and the "presumed" members)
>> really in DRM memory? Because if it resides or may reside in
>> anonymous system memory, it may well be paged out between reading
>> and writing, in which case the -EFAULT return is incorrect.
>>
>> In fact failures of pushbuf / execbuf *after* commands are
>> successfully submitted are generally very hard to recover from.
>> That's why the kernel should do whatever it takes to recover from
>> such failures, and user-space should do whatever it takes to recover
>> from copy-to-user failures of needed info from the kernel, and it
>> really depends on the user-space usage pattern of "presumed". IIRC
>> the original reason for copying it back to user-space was, that if a
>> relocation offsets were patched up by the kernel, and then the
>> process was sent a signal causing it to retry execbuf, then
>> "presumed" had to be updated, otherwise it would be inconsistent
>> with what's currently in the command stream, which is very bad. If
>> "presumed" is, however, only used by user-space to guess an offset,
>> the correct action would be to swallow the -EFAULT.
> In i915 we've had tons of fun with a regression in 3.7 where exactly this
> blows up: Some of our userspace (UXA ddx specifically) retains
> relocations-trees partially accross an execbuf. Which means if the kernel
> updates the relocations it _must_ update the presumed offset for
> otherwise things will go haywire on the next execbuf. So we can't return
> -EFAULT if the userspace memory needs to be just refaulted but still need
> to guarante a "correct" presumed offset.
>
> Since we didn't come up with a way to make sure this will work in all
> cases when we get an -EFAULT when writing back presumed offsets we have a
> rather tricky piece of fallback logic.
> - Any -EFAULT error in the fastpath will drop us into the relocation
>   slowpath. The slowpath completly processes relocs anew, so any updates
>   done by the fastpath are irrelevant.
>
> - The first thing the slowpath does is set the presumed offset in the
>   userspace reloc lists to an invalid address (we pick -1) to make sure
>   that any subsequent execbuf with the same partial reloc tree will again
>   go through the relocation update code.
>
> - Then we do the usual slowpath, i.e. copy relocs from userspace, re-grab
>   locks and then process them. The copy-back of the presumed offset
>   happens with an copy_to_user_inatomic, and we ignore any errors.
>
> Of course we try really hard to make sure that we never hit the reloc
> slowpath ;-) But nowadays this is all fully tested with some nasty
> testcases (and a small kernel option to disable prefaulting).
>
It seems userspace only updates offset and domain in nouveau. If it fails to update
it would result in the same affect as when the buffer gets moved around by TTM.
But hey maybe I'll have some fun, I'll lie to userspace, hardcode userspace offset
to 0x40000000, always force domain to be different and see what breaks.

My guess is absolutely nothing, except it might expose some bugs where we forgot annotation..

~Maarten
Thomas Hellstrom Sept. 24, 2013, 10:33 a.m. UTC | #8
On 09/24/2013 12:11 PM, Maarten Lankhorst wrote:
> Op 24-09-13 11:36, Daniel Vetter schreef:
>> On Tue, Sep 24, 2013 at 11:03:37AM +0200, Thomas Hellstrom wrote:
>>> On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
>>>> Op 24-09-13 09:22, Thomas Hellstrom schreef:
>>>>> On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
>>>>>> Hey,
>>>>>>
>>>>>> Op 13-09-13 11:00, Peter Zijlstra schreef:
>>>>>>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>>>>>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>>>>>>> if (!bo_tryreserve()) {
>>>>>>>>>>>>       up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>>>>>>       bo_reserve();               // Wait for the BO to become available (interruptible)
>>>>>>>>>>>>       bo_unreserve();           // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>>>>>>       return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>>>>>>> }
>>>>>>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>>>>>>> it seems perfectly legal to me.
>>>>>>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>>>>>>> at least I sincerely hope so.
>>>>>>>>>
>>>>>>>>> The thing that's wrong with that pattern is that its still not
>>>>>>>>> deterministic - although its a lot better than the pure trylock. Because
>>>>>>>>> you have to release and re-acquire with the trylock another user might
>>>>>>>>> have gotten in again. Its utterly prone to starvation.
>>>>>>>>>
>>>>>>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>>>>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>>>>>>> run (or even get boosted on -rt).
>>>>>>>>>
>>>>>>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>>>>>>> should be happy afaict (but I haven't had my morning juice yet).
>>>>>>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>>>>>>> ww-mutex (potentially even the same for especially nasty userspace).
>>>>>>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>>>>>>
>>>>>> I added a RFC patch below.  I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>>>>>>
>>>>>> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
>>>>>> Nouveau was a bit of a headache, but afaict it should work.
>>>>>>
>>>>>> In almost all cases relocs are not updated, so I kept intact the fastpath
>>>>>> of not copying relocs from userspace. The slowpath tries to copy it atomically,
>>>>>> and if that fails it will unreserve all bo's and copy everything.
>>>>>>
>>>>>> One thing to note is that the command submission ioctl may fail now with -EFAULT
>>>>>> if presumed cannot be updated, while the commands are submitted succesfully.
>>>>> I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
>>>>>
>>>>> Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
>>>> Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
>>>> So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
>>>> which would probably result in libdrm crashing shortly afterwards anyway.
>>> Hmm, is the list of buffer objects (and the "presumed" members)
>>> really in DRM memory? Because if it resides or may reside in
>>> anonymous system memory, it may well be paged out between reading
>>> and writing, in which case the -EFAULT return is incorrect.
>>>
>>> In fact failures of pushbuf / execbuf *after* commands are
>>> successfully submitted are generally very hard to recover from.
>>> That's why the kernel should do whatever it takes to recover from
>>> such failures, and user-space should do whatever it takes to recover
>>> from copy-to-user failures of needed info from the kernel, and it
>>> really depends on the user-space usage pattern of "presumed". IIRC
>>> the original reason for copying it back to user-space was, that if a
>>> relocation offsets were patched up by the kernel, and then the
>>> process was sent a signal causing it to retry execbuf, then
>>> "presumed" had to be updated, otherwise it would be inconsistent
>>> with what's currently in the command stream, which is very bad. If
>>> "presumed" is, however, only used by user-space to guess an offset,
>>> the correct action would be to swallow the -EFAULT.
>> In i915 we've had tons of fun with a regression in 3.7 where exactly this
>> blows up: Some of our userspace (UXA ddx specifically) retains
>> relocations-trees partially accross an execbuf. Which means if the kernel
>> updates the relocations it _must_ update the presumed offset for
>> otherwise things will go haywire on the next execbuf. So we can't return
>> -EFAULT if the userspace memory needs to be just refaulted but still need
>> to guarante a "correct" presumed offset.
>>
>> Since we didn't come up with a way to make sure this will work in all
>> cases when we get an -EFAULT when writing back presumed offsets we have a
>> rather tricky piece of fallback logic.
>> - Any -EFAULT error in the fastpath will drop us into the relocation
>>    slowpath. The slowpath completly processes relocs anew, so any updates
>>    done by the fastpath are irrelevant.
>>
>> - The first thing the slowpath does is set the presumed offset in the
>>    userspace reloc lists to an invalid address (we pick -1) to make sure
>>    that any subsequent execbuf with the same partial reloc tree will again
>>    go through the relocation update code.
>>
>> - Then we do the usual slowpath, i.e. copy relocs from userspace, re-grab
>>    locks and then process them. The copy-back of the presumed offset
>>    happens with an copy_to_user_inatomic, and we ignore any errors.
>>
>> Of course we try really hard to make sure that we never hit the reloc
>> slowpath ;-) But nowadays this is all fully tested with some nasty
>> testcases (and a small kernel option to disable prefaulting).
>>
> It seems userspace only updates offset and domain in nouveau. If it fails to update
> it would result in the same affect as when the buffer gets moved around by TTM.
> But hey maybe I'll have some fun, I'll lie to userspace, hardcode userspace offset
> to 0x40000000, always force domain to be different and see what breaks.
>
> My guess is absolutely nothing, except it might expose some bugs where we forgot annotation..
I think that would certainly break if your return an -ERESTARTSYS after 
applying relocations but
before submitting the command stream to hardware....

/Thomas

>
> ~Maarten
Maarten Lankhorst Sept. 24, 2013, 11:32 a.m. UTC | #9
Op 24-09-13 12:33, Thomas Hellstrom schreef:
> On 09/24/2013 12:11 PM, Maarten Lankhorst wrote:
>> Op 24-09-13 11:36, Daniel Vetter schreef:
>>> On Tue, Sep 24, 2013 at 11:03:37AM +0200, Thomas Hellstrom wrote:
>>>> On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
>>>>> Op 24-09-13 09:22, Thomas Hellstrom schreef:
>>>>>> On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
>>>>>>> Hey,
>>>>>>>
>>>>>>> Op 13-09-13 11:00, Peter Zijlstra schreef:
>>>>>>>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>>>>>>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>>>>>>>> if (!bo_tryreserve()) {
>>>>>>>>>>>>>       up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>>>>>>>       bo_reserve();               // Wait for the BO to become available (interruptible)
>>>>>>>>>>>>>       bo_unreserve();           // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>>>>>>>       return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>>>>>>>> }
>>>>>>>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>>>>>>>> it seems perfectly legal to me.
>>>>>>>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>>>>>>>> at least I sincerely hope so.
>>>>>>>>>>
>>>>>>>>>> The thing that's wrong with that pattern is that its still not
>>>>>>>>>> deterministic - although its a lot better than the pure trylock. Because
>>>>>>>>>> you have to release and re-acquire with the trylock another user might
>>>>>>>>>> have gotten in again. Its utterly prone to starvation.
>>>>>>>>>>
>>>>>>>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>>>>>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>>>>>>>> run (or even get boosted on -rt).
>>>>>>>>>>
>>>>>>>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>>>>>>>> should be happy afaict (but I haven't had my morning juice yet).
>>>>>>>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>>>>>>>> ww-mutex (potentially even the same for especially nasty userspace).
>>>>>>>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>>>>>>>
>>>>>>> I added a RFC patch below.  I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>>>>>>>
>>>>>>> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
>>>>>>> Nouveau was a bit of a headache, but afaict it should work.
>>>>>>>
>>>>>>> In almost all cases relocs are not updated, so I kept intact the fastpath
>>>>>>> of not copying relocs from userspace. The slowpath tries to copy it atomically,
>>>>>>> and if that fails it will unreserve all bo's and copy everything.
>>>>>>>
>>>>>>> One thing to note is that the command submission ioctl may fail now with -EFAULT
>>>>>>> if presumed cannot be updated, while the commands are submitted succesfully.
>>>>>> I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
>>>>>>
>>>>>> Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
>>>>> Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
>>>>> So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
>>>>> which would probably result in libdrm crashing shortly afterwards anyway.
>>>> Hmm, is the list of buffer objects (and the "presumed" members)
>>>> really in DRM memory? Because if it resides or may reside in
>>>> anonymous system memory, it may well be paged out between reading
>>>> and writing, in which case the -EFAULT return is incorrect.
>>>>
>>>> In fact failures of pushbuf / execbuf *after* commands are
>>>> successfully submitted are generally very hard to recover from.
>>>> That's why the kernel should do whatever it takes to recover from
>>>> such failures, and user-space should do whatever it takes to recover
>>>> from copy-to-user failures of needed info from the kernel, and it
>>>> really depends on the user-space usage pattern of "presumed". IIRC
>>>> the original reason for copying it back to user-space was, that if a
>>>> relocation offsets were patched up by the kernel, and then the
>>>> process was sent a signal causing it to retry execbuf, then
>>>> "presumed" had to be updated, otherwise it would be inconsistent
>>>> with what's currently in the command stream, which is very bad. If
>>>> "presumed" is, however, only used by user-space to guess an offset,
>>>> the correct action would be to swallow the -EFAULT.
>>> In i915 we've had tons of fun with a regression in 3.7 where exactly this
>>> blows up: Some of our userspace (UXA ddx specifically) retains
>>> relocations-trees partially accross an execbuf. Which means if the kernel
>>> updates the relocations it _must_ update the presumed offset for
>>> otherwise things will go haywire on the next execbuf. So we can't return
>>> -EFAULT if the userspace memory needs to be just refaulted but still need
>>> to guarante a "correct" presumed offset.
>>>
>>> Since we didn't come up with a way to make sure this will work in all
>>> cases when we get an -EFAULT when writing back presumed offsets we have a
>>> rather tricky piece of fallback logic.
>>> - Any -EFAULT error in the fastpath will drop us into the relocation
>>>    slowpath. The slowpath completly processes relocs anew, so any updates
>>>    done by the fastpath are irrelevant.
>>>
>>> - The first thing the slowpath does is set the presumed offset in the
>>>    userspace reloc lists to an invalid address (we pick -1) to make sure
>>>    that any subsequent execbuf with the same partial reloc tree will again
>>>    go through the relocation update code.
>>>
>>> - Then we do the usual slowpath, i.e. copy relocs from userspace, re-grab
>>>    locks and then process them. The copy-back of the presumed offset
>>>    happens with an copy_to_user_inatomic, and we ignore any errors.
>>>
>>> Of course we try really hard to make sure that we never hit the reloc
>>> slowpath ;-) But nowadays this is all fully tested with some nasty
>>> testcases (and a small kernel option to disable prefaulting).
>>>
>> It seems userspace only updates offset and domain in nouveau. If it fails to update
>> it would result in the same affect as when the buffer gets moved around by TTM.
>> But hey maybe I'll have some fun, I'll lie to userspace, hardcode userspace offset
>> to 0x40000000, always force domain to be different and see what breaks.
>>
>> My guess is absolutely nothing, except it might expose some bugs where we forgot annotation..
> I think that would certainly break if your return an -ERESTARTSYS after applying relocations but
> before submitting the command stream to hardware....
The relocations are updated before submitting the command stream, but it's copied back to userspace
after submitting the command stream. I'm not sure what -ERESTARTSYS would change, the syscall
is not in an interruptible state.

Oh and after some testing with relocations it seems they always break nouveau. Originally I've set domain
and offset differently, but that crashed the card. In my second try I kept domain and offset correct, and hacked
the kernel to force apply all relocations. Unsurprisingly nouveau still breaks, even if the relocations themselves
should have been a noop. :/

Swallowing -EFAULT is fine theoretically though, I wouldn't worry about it..

~Maarten
Thomas Hellstrom Sept. 24, 2013, 5:04 p.m. UTC | #10
On 09/24/2013 01:32 PM, Maarten Lankhorst wrote:
> Op 24-09-13 12:33, Thomas Hellstrom schreef:
>> On 09/24/2013 12:11 PM, Maarten Lankhorst wrote:
>>>
>>> It seems userspace only updates offset and domain in nouveau. If it fails to update
>>> it would result in the same affect as when the buffer gets moved around by TTM.
>>> But hey maybe I'll have some fun, I'll lie to userspace, hardcode userspace offset
>>> to 0x40000000, always force domain to be different and see what breaks.
>>>
>>> My guess is absolutely nothing, except it might expose some bugs where we forgot annotation..
>> I think that would certainly break if your return an -ERESTARTSYS after applying relocations but
>> before submitting the command stream to hardware....
> The relocations are updated before submitting the command stream, but it's copied back to userspace
> after submitting the command stream. I'm not sure what -ERESTARTSYS would change, the syscall
> is not in an interruptible state.
Hmm, I was assuming without looking at the code that there was an 
interruptible wait for pipe/ring space after
relocation application.

/Thomas
Thomas Hellstrom Sept. 24, 2013, 5:53 p.m. UTC | #11
On 09/24/2013 11:43 AM, Maarten Lankhorst wrote:
> Op 24-09-13 11:03, Thomas Hellstrom schreef:
>> On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
>>> Op 24-09-13 09:22, Thomas Hellstrom schreef:
>>>> Actually, it's not the locking order bo:reserve -> mmap_sem that triggers the lockdep warning, right? but the fact that copy_from_user could recurse into the fault handler? Grabbing bo:reseves recursively? which should leave us free to choose locking order at this point.
>>> Same thing.
>>>
>>> When PROVE_LOCKING is enabled the might_fault calls in copy_to/from_user do a fake locking of mmap_sem, which means all locking errors, provided that the reservation lock is taken at least once with mmap_sem held (eg the ttm fault handler is called at least once, it can be done separately, it doesn't need to be done in the same syscall). So any bugs will be found. The only thing to worry about is that lockdep turns off after finding 1 error, so you have to make sure it doesn't bomb out before completing tests, which is sometimes a problem on early rc kernels. ;)
>> My point was that when we only have copy_[to|from]_user_inatomic inside any bo:reservations, the might_fault would never be called inside any reservations and we should, in principle, be free to choose locking order, provided of course it could be done cleanly in fault()?
> I think it makes sense to keep mmap_sem the outer lock here, and not do scary things in fault. Especially if the mm locking is going to be changed in the future. But you're correct, if holding reservations only inatomic should be used.
>
> ~Maarten

Now that we after all were forced to enter the dark realm of copy-user 
slowpaths, I don't really have a strong opinion anymore, other than that 
we should try to avoid building too much code in that depends on either 
locking order, so that we could change it if _really_ needed.

/Thomas
diff mbox

Patch

diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index e4d60e7..2964bb7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -445,8 +445,6 @@  validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
 	      uint64_t user_pbbo_ptr)
 {
 	struct nouveau_drm *drm = chan->drm;
-	struct drm_nouveau_gem_pushbuf_bo __user *upbbo =
-				(void __force __user *)(uintptr_t)user_pbbo_ptr;
 	struct nouveau_bo *nvbo;
 	int ret, relocs = 0;
 
@@ -475,7 +473,7 @@  validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
 			return ret;
 		}
 
-		if (nv_device(drm->device)->card_type < NV_50) {
+		if (nv_device(drm->device)->card_type < NV_50 && !relocs) {
 			if (nvbo->bo.offset == b->presumed.offset &&
 			    ((nvbo->bo.mem.mem_type == TTM_PL_VRAM &&
 			      b->presumed.domain & NOUVEAU_GEM_DOMAIN_VRAM) ||
@@ -483,53 +481,86 @@  validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
 			      b->presumed.domain & NOUVEAU_GEM_DOMAIN_GART)))
 				continue;
 
-			if (nvbo->bo.mem.mem_type == TTM_PL_TT)
-				b->presumed.domain = NOUVEAU_GEM_DOMAIN_GART;
-			else
-				b->presumed.domain = NOUVEAU_GEM_DOMAIN_VRAM;
-			b->presumed.offset = nvbo->bo.offset;
-			b->presumed.valid = 0;
-			relocs++;
-
-			if (DRM_COPY_TO_USER(&upbbo[nvbo->pbbo_index].presumed,
-					     &b->presumed, sizeof(b->presumed)))
-				return -EFAULT;
+			relocs = 1;
 		}
 	}
 
 	return relocs;
 }
 
+static inline void
+u_free(void *addr)
+{
+	if (!is_vmalloc_addr(addr))
+		kfree(addr);
+	else
+		vfree(addr);
+}
+
+static inline void *
+u_memcpya(uint64_t user, unsigned nmemb, unsigned size, unsigned inatomic)
+{
+	void *mem;
+	void __user *userptr = (void __force __user *)(uintptr_t)user;
+
+	size *= nmemb;
+
+	mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!mem)
+		mem = vmalloc(size);
+	if (!mem)
+		return ERR_PTR(-ENOMEM);
+
+	if (inatomic && (!access_ok(VERIFY_READ, userptr, size) ||
+	    __copy_from_user_inatomic(mem, userptr, size))) {
+		u_free(mem);
+		return ERR_PTR(-EFAULT);
+	} else if (!inatomic && copy_from_user(mem, userptr, size)) {
+		u_free(mem);
+		return ERR_PTR(-EFAULT);
+	}
+
+	return mem;
+}
+
+static int
+nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,
+				struct drm_nouveau_gem_pushbuf *req,
+				struct drm_nouveau_gem_pushbuf_bo *bo,
+				struct drm_nouveau_gem_pushbuf_reloc *reloc);
+
 static int
 nouveau_gem_pushbuf_validate(struct nouveau_channel *chan,
 			     struct drm_file *file_priv,
 			     struct drm_nouveau_gem_pushbuf_bo *pbbo,
+			     struct drm_nouveau_gem_pushbuf *req,
 			     uint64_t user_buffers, int nr_buffers,
-			     struct validate_op *op, int *apply_relocs)
+			     struct validate_op *op, int *do_reloc)
 {
 	struct nouveau_cli *cli = nouveau_cli(file_priv);
 	int ret, relocs = 0;
+	struct drm_nouveau_gem_pushbuf_reloc *reloc = NULL;
+
+	if (nr_buffers == 0)
+		return 0;
 
+restart:
 	INIT_LIST_HEAD(&op->vram_list);
 	INIT_LIST_HEAD(&op->gart_list);
 	INIT_LIST_HEAD(&op->both_list);
 
-	if (nr_buffers == 0)
-		return 0;
-
 	ret = validate_init(chan, file_priv, pbbo, nr_buffers, op);
 	if (unlikely(ret)) {
 		if (ret != -ERESTARTSYS)
 			NV_ERROR(cli, "validate_init\n");
-		return ret;
+		goto err;
 	}
 
 	ret = validate_list(chan, cli, &op->vram_list, pbbo, user_buffers);
 	if (unlikely(ret < 0)) {
 		if (ret != -ERESTARTSYS)
 			NV_ERROR(cli, "validate vram_list\n");
-		validate_fini(op, NULL);
-		return ret;
+		goto err_fini;
 	}
 	relocs += ret;
 
@@ -537,8 +568,7 @@  nouveau_gem_pushbuf_validate(struct nouveau_channel *chan,
 	if (unlikely(ret < 0)) {
 		if (ret != -ERESTARTSYS)
 			NV_ERROR(cli, "validate gart_list\n");
-		validate_fini(op, NULL);
-		return ret;
+		goto err_fini;
 	}
 	relocs += ret;
 
@@ -546,58 +576,93 @@  nouveau_gem_pushbuf_validate(struct nouveau_channel *chan,
 	if (unlikely(ret < 0)) {
 		if (ret != -ERESTARTSYS)
 			NV_ERROR(cli, "validate both_list\n");
-		validate_fini(op, NULL);
-		return ret;
+		goto err_fini;
 	}
 	relocs += ret;
+	if (relocs) {
+		if (!reloc) {
+			//reloc = u_memcpya(req->relocs, req->nr_relocs, sizeof(*reloc), 1);
+			reloc = ERR_PTR(-EFAULT); NV_ERROR(cli, "slowpath!\n");
+		}
+		if (IS_ERR(reloc)) {
+			validate_fini(op, NULL);
+
+			if (PTR_ERR(reloc) == -EFAULT)
+				reloc = u_memcpya(req->relocs, req->nr_relocs, sizeof(*reloc), 0);
+
+			if (IS_ERR(reloc))
+				return PTR_ERR(reloc);
+			goto restart;
+		}
+
+		ret = nouveau_gem_pushbuf_reloc_apply(cli, req, pbbo, reloc);
+		if (ret) {
+			NV_ERROR(cli, "reloc apply: %d\n", ret);
+			/* No validate_fini, already called. */
+			return ret;
+		}
+		u_free(reloc);
+		*do_reloc = 1;
+	}
 
-	*apply_relocs = relocs;
 	return 0;
-}
 
-static inline void
-u_free(void *addr)
-{
-	if (!is_vmalloc_addr(addr))
-		kfree(addr);
-	else
-		vfree(addr);
+err_fini:
+	validate_fini(op, NULL);
+err:
+	if (reloc)
+		u_free(reloc);
+	return ret;
 }
 
-static inline void *
-u_memcpya(uint64_t user, unsigned nmemb, unsigned size)
+static int
+nouveau_gem_pushbuf_reloc_copy_to_user(struct drm_nouveau_gem_pushbuf *req,
+				       struct drm_nouveau_gem_pushbuf_bo *bo)
 {
-	void *mem;
-	void __user *userptr = (void __force __user *)(uintptr_t)user;
+	struct drm_nouveau_gem_pushbuf_bo __user *upbbo =
+				 (void __force __user *)(uintptr_t)req->buffers;
+	unsigned i;
 
-	size *= nmemb;
+	for (i = 0; i < req->nr_buffers; ++i) {
+		struct drm_nouveau_gem_pushbuf_bo *b = &bo[i];
 
-	mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!mem)
-		mem = vmalloc(size);
-	if (!mem)
-		return ERR_PTR(-ENOMEM);
-
-	if (DRM_COPY_FROM_USER(mem, userptr, size)) {
-		u_free(mem);
-		return ERR_PTR(-EFAULT);
+		if (!b->presumed.valid &&
+		    copy_to_user(&upbbo[i].presumed, &b->presumed, sizeof(b->presumed)))
+			return -EFAULT;
 	}
-
-	return mem;
+	return 0;
 }
 
 static int
 nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,
 				struct drm_nouveau_gem_pushbuf *req,
-				struct drm_nouveau_gem_pushbuf_bo *bo)
+				struct drm_nouveau_gem_pushbuf_bo *bo,
+				struct drm_nouveau_gem_pushbuf_reloc *reloc)
 {
-	struct drm_nouveau_gem_pushbuf_reloc *reloc = NULL;
 	int ret = 0;
 	unsigned i;
 
-	reloc = u_memcpya(req->relocs, req->nr_relocs, sizeof(*reloc));
-	if (IS_ERR(reloc))
-		return PTR_ERR(reloc);
+	for (i = 0; i < req->nr_buffers; ++i) {
+		struct drm_nouveau_gem_pushbuf_bo *b = &bo[i];
+		struct nouveau_bo *nvbo = (void *)(unsigned long)
+			bo[i].user_priv;
+
+		if (nvbo->bo.offset == b->presumed.offset &&
+		    ((nvbo->bo.mem.mem_type == TTM_PL_VRAM &&
+		      b->presumed.domain & NOUVEAU_GEM_DOMAIN_VRAM) ||
+		     (nvbo->bo.mem.mem_type == TTM_PL_TT &&
+		      b->presumed.domain & NOUVEAU_GEM_DOMAIN_GART))) {
+			b->presumed.valid = 1;
+			continue;
+		}
+
+		if (nvbo->bo.mem.mem_type == TTM_PL_TT)
+			b->presumed.domain = NOUVEAU_GEM_DOMAIN_GART;
+		else
+			b->presumed.domain = NOUVEAU_GEM_DOMAIN_VRAM;
+		b->presumed.offset = nvbo->bo.offset;
+		b->presumed.valid = 0;
+	}
 
 	for (i = 0; i < req->nr_relocs; i++) {
 		struct drm_nouveau_gem_pushbuf_reloc *r = &reloc[i];
@@ -664,8 +729,6 @@  nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,
 
 		nouveau_bo_wr32(nvbo, r->reloc_bo_offset >> 2, data);
 	}
-
-	u_free(reloc);
 	return ret;
 }
 
@@ -721,11 +784,11 @@  nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data,
 		return nouveau_abi16_put(abi16, -EINVAL);
 	}
 
-	push = u_memcpya(req->push, req->nr_push, sizeof(*push));
+	push = u_memcpya(req->push, req->nr_push, sizeof(*push), 0);
 	if (IS_ERR(push))
 		return nouveau_abi16_put(abi16, PTR_ERR(push));
 
-	bo = u_memcpya(req->buffers, req->nr_buffers, sizeof(*bo));
+	bo = u_memcpya(req->buffers, req->nr_buffers, sizeof(*bo), 0);
 	if (IS_ERR(bo)) {
 		u_free(push);
 		return nouveau_abi16_put(abi16, PTR_ERR(bo));
@@ -741,7 +804,7 @@  nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data,
 	}
 
 	/* Validate buffer list */
-	ret = nouveau_gem_pushbuf_validate(chan, file_priv, bo, req->buffers,
+	ret = nouveau_gem_pushbuf_validate(chan, file_priv, bo, req, req->buffers,
 					   req->nr_buffers, &op, &do_reloc);
 	if (ret) {
 		if (ret != -ERESTARTSYS)
@@ -749,15 +812,6 @@  nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data,
 		goto out_prevalid;
 	}
 
-	/* Apply any relocations that are required */
-	if (do_reloc) {
-		ret = nouveau_gem_pushbuf_reloc_apply(cli, req, bo);
-		if (ret) {
-			NV_ERROR(cli, "reloc apply: %d\n", ret);
-			goto out;
-		}
-	}
-
 	if (chan->dma.ib_max) {
 		ret = nouveau_dma_wait(chan, req->nr_push + 1, 16);
 		if (ret) {
@@ -837,6 +891,17 @@  out:
 	validate_fini(&op, fence);
 	nouveau_fence_unref(&fence);
 
+	if (do_reloc && !ret) {
+		ret = nouveau_gem_pushbuf_reloc_copy_to_user(req, bo);
+		if (ret) {
+			NV_ERROR(cli, "error copying presumed back to userspace: %d\n", ret);
+			/*
+			 * XXX: We return -EFAULT, but command submission
+			 * has already been completed.
+			 */
+		}
+	}
+
 out_prevalid:
 	u_free(bo);
 	u_free(push);
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 1006c15..829e911 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -64,12 +64,9 @@  static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	 * for reserve, and if it fails, retry the fault after scheduling.
 	 */
 
-	ret = ttm_bo_reserve(bo, true, true, false, 0);
-	if (unlikely(ret != 0)) {
-		if (ret == -EBUSY)
-			set_need_resched();
+	ret = ttm_bo_reserve(bo, true, false, false, 0);
+	if (unlikely(ret != 0))
 		return VM_FAULT_NOPAGE;
-	}
 
 	if (bdev->driver->fault_reserve_notify) {
 		ret = bdev->driver->fault_reserve_notify(bo);
@@ -77,6 +74,7 @@  static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		case 0:
 			break;
 		case -EBUSY:
+			WARN_ON(1);
 			set_need_resched();
 		case -ERESTARTSYS:
 			retval = VM_FAULT_NOPAGE;