diff mbox

[1/5] dma-buf: add optional invalidate_mappings callback v2

Message ID 20180316132049.1748-2-christian.koenig@amd.com (mailing list archive)
State New, archived
Headers show

Commit Message

Christian König March 16, 2018, 1:20 p.m. UTC
Each importer can now provide an invalidate_mappings callback.

This allows the exporter to provide the mappings without the need to pin
the backing store.

v2: don't try to invalidate mappings when the callback is NULL,
    lock the reservation obj while using the attachments,
    add helper to set the callback

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/dma-buf.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dma-buf.h   | 38 ++++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+)

Comments

Chris Wilson March 16, 2018, 1:51 p.m. UTC | #1
Quoting Christian König (2018-03-16 13:20:45)
> @@ -326,6 +338,29 @@ struct dma_buf_attachment {
>         struct device *dev;
>         struct list_head node;
>         void *priv;
> +
> +       /**
> +        * @invalidate_mappings:
> +        *
> +        * Optional callback provided by the importer of the attachment which
> +        * must be set before mappings are created.
> +        *
> +        * If provided the exporter can avoid pinning the backing store while
> +        * mappings exists.

Hmm, no I don't think it avoids the pinning issue entirely. As it stands,
the importer doesn't have a page refcount and so they all rely on the
exporter keeping the dmabuf pages pinned while attached. What can happen
is that given the invalidate cb, the importers can revoke their
attachments, letting the exporter recover the pages/sg, and then start
again from scratch.

That also neatly answers what happens if not all importers provide an
invalidate cb, or fail, the dmabuf remains pinned and the exporter must
retreat.

> +        *
> +        * The function is called with the lock of the reservation object
> +        * associated with the dma_buf held and the mapping function must be
> +        * called with this lock held as well. This makes sure that no mapping
> +        * is created concurrently with an ongoing invalidation.
> +        *
> +        * After the callback all existing mappings are still valid until all
> +        * fences in the dma_bufs reservation object are signaled, but should be
> +        * destroyed by the importer as soon as possible.
> +        *
> +        * New mappings can be created immediately, but can't be used before the
> +        * exclusive fence in the dma_bufs reservation object is signaled.
> +        */
> +       void (*invalidate_mappings)(struct dma_buf_attachment *attach);

The intent is that the invalidate is synchronous and immediate, while
locked? We are looking at recursing back into the dma_buf functions to
remove each attachment from the invalidate cb (as well as waiting for
dma), won't that cause some nasty issues?
-Chris
Christian König March 16, 2018, 2:22 p.m. UTC | #2
Am 16.03.2018 um 14:51 schrieb Chris Wilson:
> Quoting Christian König (2018-03-16 13:20:45)
>> @@ -326,6 +338,29 @@ struct dma_buf_attachment {
>>          struct device *dev;
>>          struct list_head node;
>>          void *priv;
>> +
>> +       /**
>> +        * @invalidate_mappings:
>> +        *
>> +        * Optional callback provided by the importer of the attachment which
>> +        * must be set before mappings are created.
>> +        *
>> +        * If provided the exporter can avoid pinning the backing store while
>> +        * mappings exists.
> Hmm, no I don't think it avoids the pinning issue entirely. As it stands,
> the importer doesn't have a page refcount and so they all rely on the
> exporter keeping the dmabuf pages pinned while attached. What can happen
> is that given the invalidate cb, the importers can revoke their
> attachments, letting the exporter recover the pages/sg, and then start
> again from scratch.

Yes, exactly. The wording is just not 100% precise and I haven't found 
something better so far.

> That also neatly answers what happens if not all importers provide an
> invalidate cb, or fail, the dmabuf remains pinned and the exporter must
> retreat.

Yes, exactly as well. As soon as at least one importer says "I can't do 
this", we must fallback to the old behavior.

>> +        *
>> +        * The function is called with the lock of the reservation object
>> +        * associated with the dma_buf held and the mapping function must be
>> +        * called with this lock held as well. This makes sure that no mapping
>> +        * is created concurrently with an ongoing invalidation.
>> +        *
>> +        * After the callback all existing mappings are still valid until all
>> +        * fences in the dma_bufs reservation object are signaled, but should be
>> +        * destroyed by the importer as soon as possible.
>> +        *
>> +        * New mappings can be created immediately, but can't be used before the
>> +        * exclusive fence in the dma_bufs reservation object is signaled.
>> +        */
>> +       void (*invalidate_mappings)(struct dma_buf_attachment *attach);
> The intent is that the invalidate is synchronous and immediate, while
> locked? We are looking at recursing back into the dma_buf functions to
> remove each attachment from the invalidate cb (as well as waiting for
> dma), won't that cause some nasty issues?

No, with this idea invalidation is asynchronous. Already discussed that 
with Daniel as well and YES Daniel also already pointed out that I need 
to better document this.

When the exporter calls invalidate_mappings() it just means that all 
importers can no longer use their sg tables for new submissions, old 
ones stay active.

The existing sg tables are guaranteed to stay valid until all fences in 
the reservation object have signaled and the importer should also delay 
it's call to call dma_buf_unmap_attachment() until all the fences have 
signaled.

When the importer has new work to do, e.g. wants to attach a new fence 
to the reservation object, it must grab a new sg table for that. The 
importer also needs to make sure that all new work touching the dma-buf 
doesn't start before the exclusive fence in the reservation object signals.

This allows for full grown pipelining, e.g. the exporter can say I need 
to move the buffer for some operation. Then let the move operation wait 
for all existing fences in the reservation object and install the fence 
of the move operation as exclusive fence.

The importer can then immediately grab a new sg table for the new 
location of the buffer and use it to prepare the next operation.

Regards,
Christian.

> -Chris
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Daniel Vetter March 19, 2018, 2:04 p.m. UTC | #3
On Fri, Mar 16, 2018 at 02:20:45PM +0100, Christian König wrote:
> Each importer can now provide an invalidate_mappings callback.
> 
> This allows the exporter to provide the mappings without the need to pin
> the backing store.
> 
> v2: don't try to invalidate mappings when the callback is NULL,
>     lock the reservation obj while using the attachments,
>     add helper to set the callback
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>

Replying here to avoid thread split, but not entirely related.

I thought some more about the lockdep splat discussion, and specifically
that amdgpu needs the reservations for the vm objects when doing a gpu
reset.

Since they're in the same ww_class as all other dma-buf reservations I'm
pretty sure lockdep will complain, at least when cross-release lockdep and
cross-release annotations for dma_fence are merged.

And as long as there's some case where amdgpu needs both the vm object
reservation and other reservations (CS?) then we must have them in the
same class, and in that case the deadlock is real. It'll require an
impressive set of circumstances most likely (the minimal number of threads
we generally needed to actually hit the cross-release stuff was 4+ or
something nuts like that, all doing something else), but it'll be real.

Otoh I think the invalidate stuff here doesn't actually make this worse,
so we can bang our heads against the gpu reset problem at leisure :-)

This stuff here has much more potential to interact badly with core mm
paths, and anything related to that (which ime also means all the cpu
hotplug machinery, which includes suspend/resume, and anything related to
the vfs because someone always manages to drag sysfs into the picture).

It's going to be fun times.

Cheers, Daniel
> ---
>  drivers/dma-buf/dma-buf.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dma-buf.h   | 38 ++++++++++++++++++++++++++++++
>  2 files changed, 98 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index d78d5fc173dc..ed2b3559ba25 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -572,7 +572,9 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
>  		if (ret)
>  			goto err_attach;
>  	}
> +	reservation_object_lock(dmabuf->resv, NULL);
>  	list_add(&attach->node, &dmabuf->attachments);
> +	reservation_object_unlock(dmabuf->resv);
>  
>  	mutex_unlock(&dmabuf->lock);
>  	return attach;
> @@ -598,7 +600,9 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach)
>  		return;
>  
>  	mutex_lock(&dmabuf->lock);
> +	reservation_object_lock(dmabuf->resv, NULL);
>  	list_del(&attach->node);
> +	reservation_object_unlock(dmabuf->resv);
>  	if (dmabuf->ops->detach)
>  		dmabuf->ops->detach(dmabuf, attach);
>  
> @@ -632,10 +636,23 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
>  	if (WARN_ON(!attach || !attach->dmabuf))
>  		return ERR_PTR(-EINVAL);
>  
> +	/*
> +	 * Mapping a DMA-buf can trigger its invalidation, prevent sending this
> +	 * event to the caller by temporary removing this attachment from the
> +	 * list.
> +	 */
> +	if (attach->invalidate_mappings) {
> +		reservation_object_assert_held(attach->dmabuf->resv);
> +		list_del(&attach->node);
> +	}
> +
>  	sg_table = attach->dmabuf->ops->map_dma_buf(attach, direction);
>  	if (!sg_table)
>  		sg_table = ERR_PTR(-ENOMEM);
>  
> +	if (attach->invalidate_mappings)
> +		list_add(&attach->node, &attach->dmabuf->attachments);
> +
>  	return sg_table;
>  }
>  EXPORT_SYMBOL_GPL(dma_buf_map_attachment);
> @@ -656,6 +673,9 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
>  {
>  	might_sleep();
>  
> +	if (attach->invalidate_mappings)
> +		reservation_object_assert_held(attach->dmabuf->resv);
> +
>  	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
>  		return;
>  
> @@ -664,6 +684,44 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
>  }
>  EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment);
>  
> +/**
> + * dma_buf_set_invalidate_callback - set the invalidate_mappings callback
> + *
> + * @attach:	[in]	attachment where to set the callback
> + * @cb:		[in]	the callback to set
> + *
> + * Makes sure to take the appropriate locks when updating the invalidate
> + * mappings callback.
> + */
> +void dma_buf_set_invalidate_callback(struct dma_buf_attachment *attach,
> +				     void (*cb)(struct dma_buf_attachment *))
> +{
> +	reservation_object_lock(attach->dmabuf->resv, NULL);
> +	attach->invalidate_mappings = cb;
> +	reservation_object_unlock(attach->dmabuf->resv);
> +}
> +EXPORT_SYMBOL_GPL(dma_buf_set_invalidate_callback);
> +
> +/**
> + * dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf
> + *
> + * @dmabuf:	[in]	buffer which mappings should be invalidated
> + *
> + * Informs all attachmenst that they need to destroy and recreated all their
> + * mappings.
> + */
> +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf)
> +{
> +	struct dma_buf_attachment *attach;
> +
> +	reservation_object_assert_held(dmabuf->resv);
> +
> +	list_for_each_entry(attach, &dmabuf->attachments, node)
> +		if (attach->invalidate_mappings)
> +			attach->invalidate_mappings(attach);
> +}
> +EXPORT_SYMBOL_GPL(dma_buf_invalidate_mappings);
> +
>  /**
>   * DOC: cpu access
>   *
> @@ -1121,10 +1179,12 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
>  		seq_puts(s, "\tAttached Devices:\n");
>  		attach_count = 0;
>  
> +		reservation_object_lock(buf_obj->resv, NULL);
>  		list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
>  			seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
>  			attach_count++;
>  		}
> +		reservation_object_unlock(buf_obj->resv);
>  
>  		seq_printf(s, "Total %d devices attached\n\n",
>  				attach_count);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 085db2fee2d7..70c65fcfe1e3 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -91,6 +91,18 @@ struct dma_buf_ops {
>  	 */
>  	void (*detach)(struct dma_buf *, struct dma_buf_attachment *);
>  
> +	/**
> +	 * @supports_mapping_invalidation:
> +	 *
> +	 * True for exporters which supports unpinned DMA-buf operation using
> +	 * the reservation lock.
> +	 *
> +	 * When attachment->invalidate_mappings is set the @map_dma_buf and
> +	 * @unmap_dma_buf callbacks can be called with the reservation lock
> +	 * held.
> +	 */
> +	bool supports_mapping_invalidation;
> +
>  	/**
>  	 * @map_dma_buf:
>  	 *
> @@ -326,6 +338,29 @@ struct dma_buf_attachment {
>  	struct device *dev;
>  	struct list_head node;
>  	void *priv;
> +
> +	/**
> +	 * @invalidate_mappings:
> +	 *
> +	 * Optional callback provided by the importer of the attachment which
> +	 * must be set before mappings are created.
> +	 *
> +	 * If provided the exporter can avoid pinning the backing store while
> +	 * mappings exists.
> +	 *
> +	 * The function is called with the lock of the reservation object
> +	 * associated with the dma_buf held and the mapping function must be
> +	 * called with this lock held as well. This makes sure that no mapping
> +	 * is created concurrently with an ongoing invalidation.
> +	 *
> +	 * After the callback all existing mappings are still valid until all
> +	 * fences in the dma_bufs reservation object are signaled, but should be
> +	 * destroyed by the importer as soon as possible.
> +	 *
> +	 * New mappings can be created immediately, but can't be used before the
> +	 * exclusive fence in the dma_bufs reservation object is signaled.
> +	 */
> +	void (*invalidate_mappings)(struct dma_buf_attachment *attach);
>  };
>  
>  /**
> @@ -391,6 +426,9 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *,
>  					enum dma_data_direction);
>  void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *,
>  				enum dma_data_direction);
> +void dma_buf_set_invalidate_callback(struct dma_buf_attachment *attach,
> +				     void (*cb)(struct dma_buf_attachment *));
> +void dma_buf_invalidate_mappings(struct dma_buf *dma_buf);
>  int dma_buf_begin_cpu_access(struct dma_buf *dma_buf,
>  			     enum dma_data_direction dir);
>  int dma_buf_end_cpu_access(struct dma_buf *dma_buf,
> -- 
> 2.14.1
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Chris Wilson March 19, 2018, 3:53 p.m. UTC | #4
Quoting Christian König (2018-03-16 14:22:32)
[snip, probably lost too must context]
> This allows for full grown pipelining, e.g. the exporter can say I need 
> to move the buffer for some operation. Then let the move operation wait 
> for all existing fences in the reservation object and install the fence 
> of the move operation as exclusive fence.

Ok, the situation I have in mind is the non-pipelined case: revoking
dma-buf for mmu_invalidate_range or shrink_slab. I would need a
completion event that can be waited on the cpu for all the invalidate
callbacks. (Essentially an atomic_t counter plus struct completion; a
lighter version of dma_fence, I wonder where I've seen that before ;)

Even so, it basically means passing a fence object down to the async
callbacks for them to signal when they are complete. Just to handle the
non-pipelined version. :|
-Chris
Christian König March 19, 2018, 4:23 p.m. UTC | #5
Am 19.03.2018 um 16:53 schrieb Chris Wilson:
> Quoting Christian König (2018-03-16 14:22:32)
> [snip, probably lost too must context]
>> This allows for full grown pipelining, e.g. the exporter can say I need
>> to move the buffer for some operation. Then let the move operation wait
>> for all existing fences in the reservation object and install the fence
>> of the move operation as exclusive fence.
> Ok, the situation I have in mind is the non-pipelined case: revoking
> dma-buf for mmu_invalidate_range or shrink_slab. I would need a
> completion event that can be waited on the cpu for all the invalidate
> callbacks. (Essentially an atomic_t counter plus struct completion; a
> lighter version of dma_fence, I wonder where I've seen that before ;)

Actually that is harmless.

When you need to unmap a DMA-buf because of mmu_invalidate_range or 
shrink_slab you need to wait for it's reservation object anyway.

This needs to be done to make sure that the backing memory is now idle, 
it doesn't matter if the jobs where submitted by DMA-buf importers or 
your own driver.

The sg tables pointing to the now released memory might live a bit 
longer, but that is unproblematic and actually intended.

When we would try to destroy the sg tables in an mmu_invalidate_range or 
shrink_slab callback we would run into a lockdep horror.

Regards,
Christian.

>
> Even so, it basically means passing a fence object down to the async
> callbacks for them to signal when they are complete. Just to handle the
> non-pipelined version. :|
> -Chris
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Daniel Vetter March 20, 2018, 7:44 a.m. UTC | #6
On Mon, Mar 19, 2018 at 5:23 PM, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
> Am 19.03.2018 um 16:53 schrieb Chris Wilson:
>>
>> Quoting Christian König (2018-03-16 14:22:32)
>> [snip, probably lost too must context]
>>>
>>> This allows for full grown pipelining, e.g. the exporter can say I need
>>> to move the buffer for some operation. Then let the move operation wait
>>> for all existing fences in the reservation object and install the fence
>>> of the move operation as exclusive fence.
>>
>> Ok, the situation I have in mind is the non-pipelined case: revoking
>> dma-buf for mmu_invalidate_range or shrink_slab. I would need a
>> completion event that can be waited on the cpu for all the invalidate
>> callbacks. (Essentially an atomic_t counter plus struct completion; a
>> lighter version of dma_fence, I wonder where I've seen that before ;)
>
>
> Actually that is harmless.
>
> When you need to unmap a DMA-buf because of mmu_invalidate_range or
> shrink_slab you need to wait for it's reservation object anyway.

reservation_object only prevents adding new fences, you still have to
wait for all the current ones to signal. Also, we have dma-access
without fences in i915. "I hold the reservation_object" does not imply
you can just go and nuke the backing storage.

> This needs to be done to make sure that the backing memory is now idle, it
> doesn't matter if the jobs where submitted by DMA-buf importers or your own
> driver.
>
> The sg tables pointing to the now released memory might live a bit longer,
> but that is unproblematic and actually intended.

I think that's very problematic. One reason for an IOMMU is that you
have device access isolation, and a broken device can't access memory
it shouldn't be able to access. From that security-in-depth point of
view it's not cool that there's some sg tables hanging around still
that a broken GPU could use. And let's not pretend hw is perfect,
especially GPUs :-)

> When we would try to destroy the sg tables in an mmu_invalidate_range or
> shrink_slab callback we would run into a lockdep horror.

So I'm no expert on this, but I think this is exactly what we're doing
in i915. Kinda no other way to actually free the memory without
throwing all the nice isolation aspects of an IOMMU into the wind. Can
you please paste the lockdeps you've seen with amdgpu when trying to
do that?
-Daniel

>
> Regards,
> Christian.
>
>>
>> Even so, it basically means passing a fence object down to the async
>> callbacks for them to signal when they are complete. Just to handle the
>> non-pipelined version. :|
>> -Chris
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>
> _______________________________________________
> Linaro-mm-sig mailing list
> Linaro-mm-sig@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
Christian König March 20, 2018, 10:54 a.m. UTC | #7
Am 20.03.2018 um 08:44 schrieb Daniel Vetter:
> On Mon, Mar 19, 2018 at 5:23 PM, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 19.03.2018 um 16:53 schrieb Chris Wilson:
>>> Quoting Christian König (2018-03-16 14:22:32)
>>> [snip, probably lost too must context]
>>>> This allows for full grown pipelining, e.g. the exporter can say I need
>>>> to move the buffer for some operation. Then let the move operation wait
>>>> for all existing fences in the reservation object and install the fence
>>>> of the move operation as exclusive fence.
>>> Ok, the situation I have in mind is the non-pipelined case: revoking
>>> dma-buf for mmu_invalidate_range or shrink_slab. I would need a
>>> completion event that can be waited on the cpu for all the invalidate
>>> callbacks. (Essentially an atomic_t counter plus struct completion; a
>>> lighter version of dma_fence, I wonder where I've seen that before ;)
>>
>> Actually that is harmless.
>>
>> When you need to unmap a DMA-buf because of mmu_invalidate_range or
>> shrink_slab you need to wait for it's reservation object anyway.
> reservation_object only prevents adding new fences, you still have to
> wait for all the current ones to signal. Also, we have dma-access
> without fences in i915. "I hold the reservation_object" does not imply
> you can just go and nuke the backing storage.

I was not talking about taking the lock, but rather using 
reservation_object_wait_timeout_rcu().

To be more precise you actually can't take the reservation object lock 
in an mmu_invalidate_range callback and you can only trylock it in a 
shrink_slab callback.

>> This needs to be done to make sure that the backing memory is now idle, it
>> doesn't matter if the jobs where submitted by DMA-buf importers or your own
>> driver.
>>
>> The sg tables pointing to the now released memory might live a bit longer,
>> but that is unproblematic and actually intended.
> I think that's very problematic. One reason for an IOMMU is that you
> have device access isolation, and a broken device can't access memory
> it shouldn't be able to access. From that security-in-depth point of
> view it's not cool that there's some sg tables hanging around still
> that a broken GPU could use. And let's not pretend hw is perfect,
> especially GPUs :-)

I completely agree on that, but there is unfortunately no other way.

See you simply can't take a reservation object lock in an mmu or slab 
callback, you can only trylock them.

For example it would require changing all allocations done while holding 
any reservation lock to GFP_NOIO.

>> When we would try to destroy the sg tables in an mmu_invalidate_range or
>> shrink_slab callback we would run into a lockdep horror.
> So I'm no expert on this, but I think this is exactly what we're doing
> in i915. Kinda no other way to actually free the memory without
> throwing all the nice isolation aspects of an IOMMU into the wind. Can
> you please paste the lockdeps you've seen with amdgpu when trying to
> do that?

Taking a quick look at i915 I can definitely say that this is actually 
quite buggy what you guys do here.

For coherent usage you need to install some lock to prevent concurrent 
get_user_pages(), command submission and 
invalidate_range_start/invalidate_range_end from the MMU notifier.

Otherwise you can't guarantee that you are actually accessing the right 
page in the case of a fork() or mprotect().

Felix and I hammered for quite some time on amdgpu until all of this was 
handled correctly, see drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c.

I can try to gather the lockdep splat from my mail history, but it 
essentially took us multiple years to get rid of all of them.

Regards,
Christian.

> -Daniel
>
>> Regards,
>> Christian.
>>
>>> Even so, it basically means passing a fence object down to the async
>>> callbacks for them to signal when they are complete. Just to handle the
>>> non-pipelined version. :|
>>> -Chris
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>> _______________________________________________
>> Linaro-mm-sig mailing list
>> Linaro-mm-sig@lists.linaro.org
>> https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>
>
Daniel Vetter March 20, 2018, 2:08 p.m. UTC | #8
On Tue, Mar 20, 2018 at 11:54:18AM +0100, Christian König wrote:
> Am 20.03.2018 um 08:44 schrieb Daniel Vetter:
> > On Mon, Mar 19, 2018 at 5:23 PM, Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> > > Am 19.03.2018 um 16:53 schrieb Chris Wilson:
> > > > Quoting Christian König (2018-03-16 14:22:32)
> > > > [snip, probably lost too must context]
> > > > > This allows for full grown pipelining, e.g. the exporter can say I need
> > > > > to move the buffer for some operation. Then let the move operation wait
> > > > > for all existing fences in the reservation object and install the fence
> > > > > of the move operation as exclusive fence.
> > > > Ok, the situation I have in mind is the non-pipelined case: revoking
> > > > dma-buf for mmu_invalidate_range or shrink_slab. I would need a
> > > > completion event that can be waited on the cpu for all the invalidate
> > > > callbacks. (Essentially an atomic_t counter plus struct completion; a
> > > > lighter version of dma_fence, I wonder where I've seen that before ;)
> > > 
> > > Actually that is harmless.
> > > 
> > > When you need to unmap a DMA-buf because of mmu_invalidate_range or
> > > shrink_slab you need to wait for it's reservation object anyway.
> > reservation_object only prevents adding new fences, you still have to
> > wait for all the current ones to signal. Also, we have dma-access
> > without fences in i915. "I hold the reservation_object" does not imply
> > you can just go and nuke the backing storage.
> 
> I was not talking about taking the lock, but rather using
> reservation_object_wait_timeout_rcu().
> 
> To be more precise you actually can't take the reservation object lock in an
> mmu_invalidate_range callback and you can only trylock it in a shrink_slab
> callback.

Ah ok, and yes agreed. Kinda. See below.

> > > This needs to be done to make sure that the backing memory is now idle, it
> > > doesn't matter if the jobs where submitted by DMA-buf importers or your own
> > > driver.
> > > 
> > > The sg tables pointing to the now released memory might live a bit longer,
> > > but that is unproblematic and actually intended.
> > I think that's very problematic. One reason for an IOMMU is that you
> > have device access isolation, and a broken device can't access memory
> > it shouldn't be able to access. From that security-in-depth point of
> > view it's not cool that there's some sg tables hanging around still
> > that a broken GPU could use. And let's not pretend hw is perfect,
> > especially GPUs :-)
> 
> I completely agree on that, but there is unfortunately no other way.
> 
> See you simply can't take a reservation object lock in an mmu or slab
> callback, you can only trylock them.
> 
> For example it would require changing all allocations done while holding any
> reservation lock to GFP_NOIO.

Yeah mmu and slab can only trylock, and they need to skip to the next
object when the trylock fails. But once you have the lock you imo should
be able to clean up the entire mess still. We definitely do that for the
i915 shrinkers, and I don't see how going to importers through the
->invalidate_mapping callback changes anything with that.

For the in-driver reservation path (CS) having a slow-path that grabs a
temporary reference, drops the vram lock and then locks the reservation
normally (using the acquire context used already for the entire CS) is a
bit tricky, but totally feasible. Ttm doesn't do that though.

So there's completely feasible ways to make sure the sg list is all
properly released, all DMA gone and the IOMMU mappings torn down. Anything
else is just a bit shoddy device driver programming imo.

> > > When we would try to destroy the sg tables in an mmu_invalidate_range or
> > > shrink_slab callback we would run into a lockdep horror.
> > So I'm no expert on this, but I think this is exactly what we're doing
> > in i915. Kinda no other way to actually free the memory without
> > throwing all the nice isolation aspects of an IOMMU into the wind. Can
> > you please paste the lockdeps you've seen with amdgpu when trying to
> > do that?
> 
> Taking a quick look at i915 I can definitely say that this is actually quite
> buggy what you guys do here.

Note that there are 2 paths for i915 userptr. One is the mmu notifier, the
other one is the root-only hack we have for dubious reasons (or that I
really don't see the point in myself).

> For coherent usage you need to install some lock to prevent concurrent
> get_user_pages(), command submission and
> invalidate_range_start/invalidate_range_end from the MMU notifier.
> 
> Otherwise you can't guarantee that you are actually accessing the right page
> in the case of a fork() or mprotect().

Yeah doing that with a full lock will create endless amounts of issues,
but I don't see why we need that. Userspace racing stuff with itself gets
to keep all the pieces. This is like racing DIRECT_IO against mprotect and
fork.

Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of
stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least
the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's
book-keeping.

In i915 we guarantee that we call set_page_dirty/mark_page_accessed only
after all the mappings are really gone (both GPU PTEs and sg mapping),
guaranteeing that any stray writes from either the GPU or IOMMU will
result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU
actually works" is an assumption behind device isolation).

> Felix and I hammered for quite some time on amdgpu until all of this was
> handled correctly, see drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c.

Maybe we should have more shared code in this, it seems to be a source of
endless amounts of fun ...

> I can try to gather the lockdep splat from my mail history, but it
> essentially took us multiple years to get rid of all of them.

I'm very much interested in specifically the splat that makes it
impossible for you folks to remove the sg mappings. That one sounds bad.
And would essentially make mmu_notifiers useless for their primary use
case, which is handling virtual machines where you really have to make
sure the IOMMU mapping is gone before you claim it's gone, because there's
no 2nd level of device checks (like GPU PTEs, or command checker) catching
stray writes.

Cheers, Daniel

> 
> Regards,
> Christian.
> 
> > -Daniel
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > Even so, it basically means passing a fence object down to the async
> > > > callbacks for them to signal when they are complete. Just to handle the
> > > > non-pipelined version. :|
> > > > -Chris
> > > > _______________________________________________
> > > > dri-devel mailing list
> > > > dri-devel@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > > 
> > > _______________________________________________
> > > Linaro-mm-sig mailing list
> > > Linaro-mm-sig@lists.linaro.org
> > > https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
> > 
> > 
>
Christian König March 20, 2018, 5:47 p.m. UTC | #9
Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
> [SNIP]
> For the in-driver reservation path (CS) having a slow-path that grabs a
> temporary reference, drops the vram lock and then locks the reservation
> normally (using the acquire context used already for the entire CS) is a
> bit tricky, but totally feasible. Ttm doesn't do that though.

That is exactly what we do in amdgpu as well, it's just not very 
efficient nor reliable to retry getting the right pages for a submission 
over and over again.

> [SNIP]
> Note that there are 2 paths for i915 userptr. One is the mmu notifier, the
> other one is the root-only hack we have for dubious reasons (or that I
> really don't see the point in myself).

Well I'm referring to i915_gem_userptr.c, if that isn't what you are 
exposing then just feel free to ignore this whole discussion.

>> For coherent usage you need to install some lock to prevent concurrent
>> get_user_pages(), command submission and
>> invalidate_range_start/invalidate_range_end from the MMU notifier.
>>
>> Otherwise you can't guarantee that you are actually accessing the right page
>> in the case of a fork() or mprotect().
> Yeah doing that with a full lock will create endless amounts of issues,
> but I don't see why we need that. Userspace racing stuff with itself gets
> to keep all the pieces. This is like racing DIRECT_IO against mprotect and
> fork.

First of all I strongly disagree on that. A thread calling fork() 
because it wants to run a command is not something we can forbid just 
because we have a gfx stack loaded. That the video driver is not capable 
of handling that correct is certainly not the problem of userspace.

Second it's not only userspace racing here, you can get into this kind 
of issues just because of transparent huge page support where the 
background daemon tries to reallocate the page tables into bigger chunks.

And if I'm not completely mistaken you can also open up quite a bunch of 
security problems if you suddenly access the wrong page.

> Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of
> stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least
> the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's
> book-keeping.
>
> In i915 we guarantee that we call set_page_dirty/mark_page_accessed only
> after all the mappings are really gone (both GPU PTEs and sg mapping),
> guaranteeing that any stray writes from either the GPU or IOMMU will
> result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU
> actually works" is an assumption behind device isolation).
Well exactly that's the point, the handling in i915 looks incorrect to 
me. You need to call set_page_dirty/mark_page_accessed way before the 
mapping is destroyed.

To be more precise for userptrs it must be called from the 
invalidate_range_start, but i915 seems to delegate everything into a 
background worker to avoid the locking problems.

>> Felix and I hammered for quite some time on amdgpu until all of this was
>> handled correctly, see drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c.
> Maybe we should have more shared code in this, it seems to be a source of
> endless amounts of fun ...
>
>> I can try to gather the lockdep splat from my mail history, but it
>> essentially took us multiple years to get rid of all of them.
> I'm very much interested in specifically the splat that makes it
> impossible for you folks to remove the sg mappings. That one sounds bad.
> And would essentially make mmu_notifiers useless for their primary use
> case, which is handling virtual machines where you really have to make
> sure the IOMMU mapping is gone before you claim it's gone, because there's
> no 2nd level of device checks (like GPU PTEs, or command checker) catching
> stray writes.

Well to be more precise the problem is not that we can't destroy the sg 
table, but rather that we can't grab the locks to do so.

See when you need to destroy the sg table you usually need to grab the 
same lock you grabbed when you created it.

And all locks taken while in an MMU notifier can only depend on memory 
allocation with GFP_NOIO, which is not really feasible for gfx drivers.

Regards,
Christian.
Daniel Vetter March 21, 2018, 8:18 a.m. UTC | #10
On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
> Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
> > [SNIP]
> > For the in-driver reservation path (CS) having a slow-path that grabs a
> > temporary reference, drops the vram lock and then locks the reservation
> > normally (using the acquire context used already for the entire CS) is a
> > bit tricky, but totally feasible. Ttm doesn't do that though.
> 
> That is exactly what we do in amdgpu as well, it's just not very efficient
> nor reliable to retry getting the right pages for a submission over and over
> again.
> 
> > [SNIP]
> > Note that there are 2 paths for i915 userptr. One is the mmu notifier, the
> > other one is the root-only hack we have for dubious reasons (or that I
> > really don't see the point in myself).
> 
> Well I'm referring to i915_gem_userptr.c, if that isn't what you are
> exposing then just feel free to ignore this whole discussion.

They're both in i915_gem_userptr.c, somewhat interleaved. Would be
interesting if you could show what you think is going wrong in there
compared to amdgpu_mn.c.

> > > For coherent usage you need to install some lock to prevent concurrent
> > > get_user_pages(), command submission and
> > > invalidate_range_start/invalidate_range_end from the MMU notifier.
> > > 
> > > Otherwise you can't guarantee that you are actually accessing the right page
> > > in the case of a fork() or mprotect().
> > Yeah doing that with a full lock will create endless amounts of issues,
> > but I don't see why we need that. Userspace racing stuff with itself gets
> > to keep all the pieces. This is like racing DIRECT_IO against mprotect and
> > fork.
> 
> First of all I strongly disagree on that. A thread calling fork() because it
> wants to run a command is not something we can forbid just because we have a
> gfx stack loaded. That the video driver is not capable of handling that
> correct is certainly not the problem of userspace.
> 
> Second it's not only userspace racing here, you can get into this kind of
> issues just because of transparent huge page support where the background
> daemon tries to reallocate the page tables into bigger chunks.
> 
> And if I'm not completely mistaken you can also open up quite a bunch of
> security problems if you suddenly access the wrong page.

I get a feeling we're talking past each another here. Can you perhaps
explain what exactly the race is you're seeing? The i915 userptr code is
fairly convoluted and pushes a lot of stuff to workers (but then syncs
with those workers again later on), so easily possible you've overlooked
one of these lines that might guarantee already what you think needs to be
guaranteed. We're definitely not aiming to allow userspace to allow
writing to random pages all over.

> > Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of
> > stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least
> > the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's
> > book-keeping.
> > 
> > In i915 we guarantee that we call set_page_dirty/mark_page_accessed only
> > after all the mappings are really gone (both GPU PTEs and sg mapping),
> > guaranteeing that any stray writes from either the GPU or IOMMU will
> > result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU
> > actually works" is an assumption behind device isolation).
> Well exactly that's the point, the handling in i915 looks incorrect to me.
> You need to call set_page_dirty/mark_page_accessed way before the mapping is
> destroyed.
> 
> To be more precise for userptrs it must be called from the
> invalidate_range_start, but i915 seems to delegate everything into a
> background worker to avoid the locking problems.

Yeah, and at the end of the function there's a flush_work to make sure the
worker has caught up.

The set_page_dirty is also there, but hidden very deep in the call chain
as part of the vma unmapping and backing storage unpinning. But I do think
we guarantee what you expect needs to happen.

> > > Felix and I hammered for quite some time on amdgpu until all of this was
> > > handled correctly, see drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c.
> > Maybe we should have more shared code in this, it seems to be a source of
> > endless amounts of fun ...
> > 
> > > I can try to gather the lockdep splat from my mail history, but it
> > > essentially took us multiple years to get rid of all of them.
> > I'm very much interested in specifically the splat that makes it
> > impossible for you folks to remove the sg mappings. That one sounds bad.
> > And would essentially make mmu_notifiers useless for their primary use
> > case, which is handling virtual machines where you really have to make
> > sure the IOMMU mapping is gone before you claim it's gone, because there's
> > no 2nd level of device checks (like GPU PTEs, or command checker) catching
> > stray writes.
> 
> Well to be more precise the problem is not that we can't destroy the sg
> table, but rather that we can't grab the locks to do so.
> 
> See when you need to destroy the sg table you usually need to grab the same
> lock you grabbed when you created it.
> 
> And all locks taken while in an MMU notifier can only depend on memory
> allocation with GFP_NOIO, which is not really feasible for gfx drivers.

I know. i915 gem has tons of fallbacks and retry loops (we restart the
entire CS if needed), and i915 userptr pushes the entire get_user_pages
dance off into a worker if the fastpath doesn't succeed and we run out of
memory or hit contended locks. We also have obscene amounts of
__GFP_NORETRY and __GFP_NOWARN all over the place to make sure the core mm
code doesn't do something we don't want it do to do in the fastpaths
(because there's really no point in spending lots of time trying to make
memory available if we have a slowpath fallback with much less
constraints).

We're also not limiting ourselves to GFP_NOIO, but instead have a
recursion detection&handling in our own shrinker callback to avoid these
deadlocks.

It's definitely not easy, and there's lots of horrible code, but it is
possible. Just not releasing the sg mappings while you no longer own the
memory isn't an option imo.
-Daniel
Daniel Vetter March 21, 2018, 8:28 a.m. UTC | #11
On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
> Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
> > [SNIP]
> > For the in-driver reservation path (CS) having a slow-path that grabs a
> > temporary reference, drops the vram lock and then locks the reservation
> > normally (using the acquire context used already for the entire CS) is a
> > bit tricky, but totally feasible. Ttm doesn't do that though.
> 
> That is exactly what we do in amdgpu as well, it's just not very efficient
> nor reliable to retry getting the right pages for a submission over and over
> again.

Out of curiosity, where's that code? I did read the ttm eviction code way
back, and that one definitely didn't do that. Would be interesting to
update my understanding.
-Daniel
Christian König March 21, 2018, 9:34 a.m. UTC | #12
Am 21.03.2018 um 09:18 schrieb Daniel Vetter:
> [SNIP]
> They're both in i915_gem_userptr.c, somewhat interleaved. Would be
> interesting if you could show what you think is going wrong in there
> compared to amdgpu_mn.c.

i915 implements only one callback:
> static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
>         .invalidate_range_start = 
> i915_gem_userptr_mn_invalidate_range_start,
> };
For correct operation you always need to implement invalidate_range_end 
as well and add some lock/completion work Otherwise get_user_pages() can 
again grab the reference to the wrong page.

The next problem seems to be that cancel_userptr() doesn't prevent any 
new command submission. E.g.
> i915_gem_object_wait(obj, I915_WAIT_ALL, MAX_SCHEDULE_TIMEOUT, NULL);
What prevents new command submissions to use the GEM object directly 
after you finished waiting here?

> I get a feeling we're talking past each another here.
Yeah, agree. Additional to that I don't know the i915 code very well.

> Can you perhaps explain what exactly the race is you're seeing? The i915 userptr code is
> fairly convoluted and pushes a lot of stuff to workers (but then syncs
> with those workers again later on), so easily possible you've overlooked
> one of these lines that might guarantee already what you think needs to be
> guaranteed. We're definitely not aiming to allow userspace to allow
> writing to random pages all over.

You not read/write to random pages, there still is a reference to the 
page. So that the page can't be reused until you are done.

The problem is rather that you can't guarantee that you write to the 
page which is mapped into the process at that location. E.g. the CPU and 
the GPU might see two different things.

>>> Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of
>>> stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least
>>> the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's
>>> book-keeping.
>>>
>>> In i915 we guarantee that we call set_page_dirty/mark_page_accessed only
>>> after all the mappings are really gone (both GPU PTEs and sg mapping),
>>> guaranteeing that any stray writes from either the GPU or IOMMU will
>>> result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU
>>> actually works" is an assumption behind device isolation).
>> Well exactly that's the point, the handling in i915 looks incorrect to me.
>> You need to call set_page_dirty/mark_page_accessed way before the mapping is
>> destroyed.
>>
>> To be more precise for userptrs it must be called from the
>> invalidate_range_start, but i915 seems to delegate everything into a
>> background worker to avoid the locking problems.
> Yeah, and at the end of the function there's a flush_work to make sure the
> worker has caught up.
Ah, yes haven't seen that.

But then grabbing the obj->base.dev->struct_mutex lock in 
cancel_userptr() is rather evil. You just silenced lockdep because you 
offloaded that into a work item.

So no matter how you put it i915 is clearly doing something wrong here :)

> I know. i915 gem has tons of fallbacks and retry loops (we restart the
> entire CS if needed), and i915 userptr pushes the entire get_user_pages
> dance off into a worker if the fastpath doesn't succeed and we run out of
> memory or hit contended locks. We also have obscene amounts of
> __GFP_NORETRY and __GFP_NOWARN all over the place to make sure the core mm
> code doesn't do something we don't want it do to do in the fastpaths
> (because there's really no point in spending lots of time trying to make
> memory available if we have a slowpath fallback with much less
> constraints).
Well I haven't audited the code, but I'm pretty sure that just mitigates 
the problem and silenced lockdep instead of really fixing the issue.

> We're also not limiting ourselves to GFP_NOIO, but instead have a
> recursion detection&handling in our own shrinker callback to avoid these
> deadlocks.

Which if you ask me is absolutely horrible. I mean the comment in 
linux/mutex.h sums it up pretty well:
>  * This function should not be used, _ever_. It is purely for 
> hysterical GEM
>  * raisins, and once those are gone this will be removed.

Regards,
Christian.
Christian König March 21, 2018, 11:54 a.m. UTC | #13
Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
> On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
>> Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
>>> [SNIP]
>>> For the in-driver reservation path (CS) having a slow-path that grabs a
>>> temporary reference, drops the vram lock and then locks the reservation
>>> normally (using the acquire context used already for the entire CS) is a
>>> bit tricky, but totally feasible. Ttm doesn't do that though.
>> That is exactly what we do in amdgpu as well, it's just not very efficient
>> nor reliable to retry getting the right pages for a submission over and over
>> again.
> Out of curiosity, where's that code? I did read the ttm eviction code way
> back, and that one definitely didn't do that. Would be interesting to
> update my understanding.

That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance 
with grabbing, releasing and regrabbing locks in a loop.

Then in amdgpu_cs_submit() we grab an lock preventing page table updates 
and check if all pages are still the one we want to have:
>         amdgpu_mn_lock(p->mn);
>         if (p->bo_list) {
>                 for (i = p->bo_list->first_userptr;
>                      i < p->bo_list->num_entries; ++i) {
>                         struct amdgpu_bo *bo = p->bo_list->array[i].robj;
>
>                         if 
> (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) {
>                                 amdgpu_mn_unlock(p->mn);
>                                 return -ERESTARTSYS;
>                         }
>                 }
>         }

If anything changed on the page tables we restart the whole IOCTL using 
-ERESTARTSYS and try again.

Regards,
Christian.

> -Daniel
Daniel Vetter March 22, 2018, 7:14 a.m. UTC | #14
On Wed, Mar 21, 2018 at 10:34:05AM +0100, Christian König wrote:
> Am 21.03.2018 um 09:18 schrieb Daniel Vetter:
> > [SNIP]
> > They're both in i915_gem_userptr.c, somewhat interleaved. Would be
> > interesting if you could show what you think is going wrong in there
> > compared to amdgpu_mn.c.
> 
> i915 implements only one callback:
> > static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
> >         .invalidate_range_start =
> > i915_gem_userptr_mn_invalidate_range_start,
> > };
> For correct operation you always need to implement invalidate_range_end as
> well and add some lock/completion work Otherwise get_user_pages() can again
> grab the reference to the wrong page.

Is this really a problem? I figured that if a mmu_notifier invalidation is
going on, a get_user_pages on that mm from anywhere else (whether i915 or
anyone really) will serialize with the ongoing invalidate? If that's not
the case, then really any get_user_pages is racy, including all the
DIRECT_IO ones.

> The next problem seems to be that cancel_userptr() doesn't prevent any new
> command submission. E.g.
> > i915_gem_object_wait(obj, I915_WAIT_ALL, MAX_SCHEDULE_TIMEOUT, NULL);
> What prevents new command submissions to use the GEM object directly after
> you finished waiting here?
> 
> > I get a feeling we're talking past each another here.
> Yeah, agree. Additional to that I don't know the i915 code very well.
> 
> > Can you perhaps explain what exactly the race is you're seeing? The i915 userptr code is
> > fairly convoluted and pushes a lot of stuff to workers (but then syncs
> > with those workers again later on), so easily possible you've overlooked
> > one of these lines that might guarantee already what you think needs to be
> > guaranteed. We're definitely not aiming to allow userspace to allow
> > writing to random pages all over.
> 
> You not read/write to random pages, there still is a reference to the page.
> So that the page can't be reused until you are done.
> 
> The problem is rather that you can't guarantee that you write to the page
> which is mapped into the process at that location. E.g. the CPU and the GPU
> might see two different things.
> 
> > > > Leaking the IOMMU mappings otoh means rogue userspace could do a bunch of
> > > > stray writes (I don't see anywhere code in amdgpu_mn.c to unmap at least
> > > > the gpu side PTEs to make stuff inaccessible) and wreak the core kernel's
> > > > book-keeping.
> > > > 
> > > > In i915 we guarantee that we call set_page_dirty/mark_page_accessed only
> > > > after all the mappings are really gone (both GPU PTEs and sg mapping),
> > > > guaranteeing that any stray writes from either the GPU or IOMMU will
> > > > result in faults (except bugs in the IOMMU, but can't have it all, "IOMMU
> > > > actually works" is an assumption behind device isolation).
> > > Well exactly that's the point, the handling in i915 looks incorrect to me.
> > > You need to call set_page_dirty/mark_page_accessed way before the mapping is
> > > destroyed.
> > > 
> > > To be more precise for userptrs it must be called from the
> > > invalidate_range_start, but i915 seems to delegate everything into a
> > > background worker to avoid the locking problems.
> > Yeah, and at the end of the function there's a flush_work to make sure the
> > worker has caught up.
> Ah, yes haven't seen that.
> 
> But then grabbing the obj->base.dev->struct_mutex lock in cancel_userptr()
> is rather evil. You just silenced lockdep because you offloaded that into a
> work item.
> 
> So no matter how you put it i915 is clearly doing something wrong here :)

tbh I'm not entirely clear on the reasons why this works, but
cross-release lockdep catches these things, and it did not complain.

On a high-level we make sure that mm locks needed by get_user_pages do
_not_ nest within dev->struct_mutex. We have massive back-off slowpaths to
do anything that could fault outside of our own main gem locking.

That was (at least in the past) a major difference with amdgpu, which
essentially has none of these paths. That would trivially deadlock with
your own gem mmap fault handler, so you had (maybe that changed) a dumb
retry loop, which did shut up lockdep but didn't fix any of the locking
inversions.

So yeah, grabbing dev->struct_mutex is in principle totally fine while
holding all kinds of struct mm/vma locks. I'm not entirely clear why we
punt the actual unmapping to the worker though, maybe simply to not have a
constrained stack.

> > I know. i915 gem has tons of fallbacks and retry loops (we restart the
> > entire CS if needed), and i915 userptr pushes the entire get_user_pages
> > dance off into a worker if the fastpath doesn't succeed and we run out of
> > memory or hit contended locks. We also have obscene amounts of
> > __GFP_NORETRY and __GFP_NOWARN all over the place to make sure the core mm
> > code doesn't do something we don't want it do to do in the fastpaths
> > (because there's really no point in spending lots of time trying to make
> > memory available if we have a slowpath fallback with much less
> > constraints).
> Well I haven't audited the code, but I'm pretty sure that just mitigates the
> problem and silenced lockdep instead of really fixing the issue.

This is re: your statement that you can't unamp sg tables from the
shrinker. We can, because we've actually untangled the locking depencies
so that you can fully operate on gem objects from within mm/vma locks.
Maybe code has changed, but last time I looked at radeon/ttm a while back
that was totally not the case, and if you don't do all this work then yes
you'll deadlock.

Doen't mean it's not impossible, because we've done it :-)

> > We're also not limiting ourselves to GFP_NOIO, but instead have a
> > recursion detection&handling in our own shrinker callback to avoid these
> > deadlocks.
> 
> Which if you ask me is absolutely horrible. I mean the comment in
> linux/mutex.h sums it up pretty well:
> >  * This function should not be used, _ever_. It is purely for hysterical
> > GEM
> >  * raisins, and once those are gone this will be removed.

Well, it actually gets the job done. We'd need to at least get to
per-object locking, and probably even then we'd need to rewrite the code a
lot. But please note that this here is only to avoid the GFP_NOIO
constraint, all the other bits I clarified around why we don't actually
have circular locking (because the entire hierarchy is inverted for us)
still hold even if you would only trylock here.

Aside: Given that yesterday a bunch of folks complained on #dri-devel that
amdgpu prematurely OOMs compared to i915, and that we've switched from a
simple trylock to this nastiness to be able to recover from more low
memory situation it's maybe not such a silly idea. Horrible, but not silly
because actually necessary.
-Daniel
Daniel Vetter March 22, 2018, 7:18 a.m. UTC | #15
On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote:
> Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
> > On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
> > > Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
> > > > [SNIP]
> > > > For the in-driver reservation path (CS) having a slow-path that grabs a
> > > > temporary reference, drops the vram lock and then locks the reservation
> > > > normally (using the acquire context used already for the entire CS) is a
> > > > bit tricky, but totally feasible. Ttm doesn't do that though.
> > > That is exactly what we do in amdgpu as well, it's just not very efficient
> > > nor reliable to retry getting the right pages for a submission over and over
> > > again.
> > Out of curiosity, where's that code? I did read the ttm eviction code way
> > back, and that one definitely didn't do that. Would be interesting to
> > update my understanding.
> 
> That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with
> grabbing, releasing and regrabbing locks in a loop.
> 
> Then in amdgpu_cs_submit() we grab an lock preventing page table updates and
> check if all pages are still the one we want to have:
> >         amdgpu_mn_lock(p->mn);
> >         if (p->bo_list) {
> >                 for (i = p->bo_list->first_userptr;
> >                      i < p->bo_list->num_entries; ++i) {
> >                         struct amdgpu_bo *bo = p->bo_list->array[i].robj;
> > 
> >                         if
> > (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) {
> >                                 amdgpu_mn_unlock(p->mn);
> >                                 return -ERESTARTSYS;
> >                         }
> >                 }
> >         }
> 
> If anything changed on the page tables we restart the whole IOCTL using
> -ERESTARTSYS and try again.

I'm not talking about userptr here, but general bo eviction. Sorry for the
confusion.

The reason I'm dragging all the general bo management into this
discussions is because we do seem to have fairly fundamental difference in
how that's done, with resulting consequences for the locking hierarchy.

And if this invalidate_mapping stuff should work, together with userptr
and everything else, I think we're required to agree on how this is all
supposed to nest, and how exactly we should back off for the other side
that needs to break the locking circle.

That aside, I don't entirely understand why you need to restart so much. I
figured that get_user_pages is ordered correctly against mmu
invalidations, but I get the impression you think that's not the case. How
does that happen?
-Daniel
Christian König March 22, 2018, 9:37 a.m. UTC | #16
Am 22.03.2018 um 08:14 schrieb Daniel Vetter:
> On Wed, Mar 21, 2018 at 10:34:05AM +0100, Christian König wrote:
>> Am 21.03.2018 um 09:18 schrieb Daniel Vetter:
>>> [SNIP]
>> For correct operation you always need to implement invalidate_range_end as
>> well and add some lock/completion work Otherwise get_user_pages() can again
>> grab the reference to the wrong page.
> Is this really a problem?

Yes, and quite a big one.

> I figured that if a mmu_notifier invalidation is
> going on, a get_user_pages on that mm from anywhere else (whether i915 or
> anyone really) will serialize with the ongoing invalidate?

No, that isn't correct. Jerome can probably better explain that than I do.

> If that's not the case, then really any get_user_pages is racy, including all the
> DIRECT_IO ones.

The key point here is that get_user_pages() grabs a reference to the 
page. So what you get is a bunch of pages which where mapped at that 
location at a specific point in time.

There is no guarantee that after get_user_pages() return you still have 
the same pages mapped at that point, you only guarantee that the pages 
are not reused for something else.

That is perfectly sufficient for a task like DIRECT_IO where you can 
only have block or network I/O, but unfortunately not really for GPUs 
where you crunch of results, write them back to pages and actually count 
on that the CPU sees the result in the right place.

>> [SNIP]
>> So no matter how you put it i915 is clearly doing something wrong here :)
> tbh I'm not entirely clear on the reasons why this works, but
> cross-release lockdep catches these things, and it did not complain.
> On a high-level we make sure that mm locks needed by get_user_pages do
> _not_ nest within dev->struct_mutex. We have massive back-off slowpaths to
> do anything that could fault outside of our own main gem locking.

I'm pretty sure that this doesn't work as intended and just hides the 
real problem.

> That was (at least in the past) a major difference with amdgpu, which
> essentially has none of these paths. That would trivially deadlock with
> your own gem mmap fault handler, so you had (maybe that changed) a dumb
> retry loop, which did shut up lockdep but didn't fix any of the locking
> inversions.

Any lock you grab in an MMU callback can't be even held when you call 
kmalloc() or get_free_page() (without GFP_NOIO).

Even simple things like drm_vm_open() violate that by using GFP_KERNEL. 
So I can 100% ensure you that what you do here is not correct.

> So yeah, grabbing dev->struct_mutex is in principle totally fine while
> holding all kinds of struct mm/vma locks. I'm not entirely clear why we
> punt the actual unmapping to the worker though, maybe simply to not have a
> constrained stack.

I strongly disagree on that. As far as I can see what TTM does looks 
actually like the right approach to the problem.

> This is re: your statement that you can't unamp sg tables from the
> shrinker. We can, because we've actually untangled the locking depencies
> so that you can fully operate on gem objects from within mm/vma locks.
> Maybe code has changed, but last time I looked at radeon/ttm a while back
> that was totally not the case, and if you don't do all this work then yes
> you'll deadlock.
>
> Doen't mean it's not impossible, because we've done it :-)

And I'm pretty sure you didn't do it correctly :D

> Well, it actually gets the job done. We'd need to at least get to
> per-object locking, and probably even then we'd need to rewrite the code a
> lot. But please note that this here is only to avoid the GFP_NOIO
> constraint, all the other bits I clarified around why we don't actually
> have circular locking (because the entire hierarchy is inverted for us)
> still hold even if you would only trylock here.

Well you reversed your allocation and mmap_sem lock which avoids the 
lock inversion during page faults, but it doesn't help you at all with 
the MMU notifier and shrinker because then there are a lot more locks 
involved.

Regards,
Christian.

> Aside: Given that yesterday a bunch of folks complained on #dri-devel that
> amdgpu prematurely OOMs compared to i915, and that we've switched from a
> simple trylock to this nastiness to be able to recover from more low
> memory situation it's maybe not such a silly idea. Horrible, but not silly
> because actually necessary.
> -Daniel
Christian König March 22, 2018, 9:58 a.m. UTC | #17
Am 22.03.2018 um 08:18 schrieb Daniel Vetter:
> On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote:
>> Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
>>> On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
>>>> Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
>>>>> [SNIP]
>>>>> For the in-driver reservation path (CS) having a slow-path that grabs a
>>>>> temporary reference, drops the vram lock and then locks the reservation
>>>>> normally (using the acquire context used already for the entire CS) is a
>>>>> bit tricky, but totally feasible. Ttm doesn't do that though.
>>>> That is exactly what we do in amdgpu as well, it's just not very efficient
>>>> nor reliable to retry getting the right pages for a submission over and over
>>>> again.
>>> Out of curiosity, where's that code? I did read the ttm eviction code way
>>> back, and that one definitely didn't do that. Would be interesting to
>>> update my understanding.
>> That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with
>> grabbing, releasing and regrabbing locks in a loop.
>>
>> Then in amdgpu_cs_submit() we grab an lock preventing page table updates and
>> check if all pages are still the one we want to have:
>>>          amdgpu_mn_lock(p->mn);
>>>          if (p->bo_list) {
>>>                  for (i = p->bo_list->first_userptr;
>>>                       i < p->bo_list->num_entries; ++i) {
>>>                          struct amdgpu_bo *bo = p->bo_list->array[i].robj;
>>>
>>>                          if
>>> (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) {
>>>                                  amdgpu_mn_unlock(p->mn);
>>>                                  return -ERESTARTSYS;
>>>                          }
>>>                  }
>>>          }
>> If anything changed on the page tables we restart the whole IOCTL using
>> -ERESTARTSYS and try again.
> I'm not talking about userptr here, but general bo eviction. Sorry for the
> confusion.
>
> The reason I'm dragging all the general bo management into this
> discussions is because we do seem to have fairly fundamental difference in
> how that's done, with resulting consequences for the locking hierarchy.
>
> And if this invalidate_mapping stuff should work, together with userptr
> and everything else, I think we're required to agree on how this is all
> supposed to nest, and how exactly we should back off for the other side
> that needs to break the locking circle.
>
> That aside, I don't entirely understand why you need to restart so much. I
> figured that get_user_pages is ordered correctly against mmu
> invalidations, but I get the impression you think that's not the case. How
> does that happen?

Correct. I've had the same assumption, but both Jerome as well as our 
internal tests proved me wrong on that.

Key take away from that was that you can't take any locks from neither 
the MMU notifier nor the shrinker you also take while calling kmalloc 
(or simpler speaking get_user_pages()).

Additional to that in the MMU or shrinker callback all different kinds 
of locks might be held, so you basically can't assume that you do thinks 
like recursive page table walks or call dma_unmap_anything.

Thinking a moment about that it actually seems to make perfect sense.

So it doesn't matter what order you got between the mmap_sem and your 
buffer or allocation lock, it will simply be incorrect with other locks 
in the system anyway.

The only solution to that problem we have found is the dance we have in 
amdgpu now:

1. Inside invalidate_range_start you grab a recursive read side lock.
2. Wait for all GPU operation on that pages to finish.
3. Then mark all pages as dirty and accessed.
4. Inside invalidate_range_end you release your recursive read side lock 
again.

Then during command submission you do the following:

1. Take the locks Figure out all the userptr BOs which needs new pages.
2. Drop the locks again call get_user_pages().
3. Install the new page arrays and reiterate with #1
4. As soon as everybody has pages update your page tables and prepare 
all command submission.
5. Take the write side of our recursive lock.
6. Check if all pages are still valid, if not restart the whole IOCTL.
7. Submit the job to the hardware.
8. Drop the write side of our recursive lock.

Regards,
Christian.

> -Daniel
Daniel Vetter March 26, 2018, 7:51 a.m. UTC | #18
On Thu, Mar 22, 2018 at 10:37:55AM +0100, Christian König wrote:
> Am 22.03.2018 um 08:14 schrieb Daniel Vetter:
> > On Wed, Mar 21, 2018 at 10:34:05AM +0100, Christian König wrote:
> > > Am 21.03.2018 um 09:18 schrieb Daniel Vetter:
> > > > [SNIP]
> > > For correct operation you always need to implement invalidate_range_end as
> > > well and add some lock/completion work Otherwise get_user_pages() can again
> > > grab the reference to the wrong page.
> > Is this really a problem?
> 
> Yes, and quite a big one.
> 
> > I figured that if a mmu_notifier invalidation is
> > going on, a get_user_pages on that mm from anywhere else (whether i915 or
> > anyone really) will serialize with the ongoing invalidate?
> 
> No, that isn't correct. Jerome can probably better explain that than I do.
> 
> > If that's not the case, then really any get_user_pages is racy, including all the
> > DIRECT_IO ones.
> 
> The key point here is that get_user_pages() grabs a reference to the page.
> So what you get is a bunch of pages which where mapped at that location at a
> specific point in time.
> 
> There is no guarantee that after get_user_pages() return you still have the
> same pages mapped at that point, you only guarantee that the pages are not
> reused for something else.
> 
> That is perfectly sufficient for a task like DIRECT_IO where you can only
> have block or network I/O, but unfortunately not really for GPUs where you
> crunch of results, write them back to pages and actually count on that the
> CPU sees the result in the right place.

Hm ok, I'll chat with Jerome about this. I thought we have epic amounts of
userptr tests, including thrashing the mappings vs gpu activity, so I'm
somewhat surprised that this hasn't blown up yet.

> > > [SNIP]
> > > So no matter how you put it i915 is clearly doing something wrong here :)
> > tbh I'm not entirely clear on the reasons why this works, but
> > cross-release lockdep catches these things, and it did not complain.
> > On a high-level we make sure that mm locks needed by get_user_pages do
> > _not_ nest within dev->struct_mutex. We have massive back-off slowpaths to
> > do anything that could fault outside of our own main gem locking.
> 
> I'm pretty sure that this doesn't work as intended and just hides the real
> problem.
> 
> > That was (at least in the past) a major difference with amdgpu, which
> > essentially has none of these paths. That would trivially deadlock with
> > your own gem mmap fault handler, so you had (maybe that changed) a dumb
> > retry loop, which did shut up lockdep but didn't fix any of the locking
> > inversions.
> 
> Any lock you grab in an MMU callback can't be even held when you call
> kmalloc() or get_free_page() (without GFP_NOIO).
> 
> Even simple things like drm_vm_open() violate that by using GFP_KERNEL. So I
> can 100% ensure you that what you do here is not correct.

drm_vm_open isn't used by modern drivers anymore. We have validated the
locking with the cross-release stuff for a few weeks, and it didn't catch
stuff. So I'm not worried that the locking is busted, only the mmu
notifier vs. get_user_pages races concerns me.

> > So yeah, grabbing dev->struct_mutex is in principle totally fine while
> > holding all kinds of struct mm/vma locks. I'm not entirely clear why we
> > punt the actual unmapping to the worker though, maybe simply to not have a
> > constrained stack.
> 
> I strongly disagree on that. As far as I can see what TTM does looks
> actually like the right approach to the problem.
> 
> > This is re: your statement that you can't unamp sg tables from the
> > shrinker. We can, because we've actually untangled the locking depencies
> > so that you can fully operate on gem objects from within mm/vma locks.
> > Maybe code has changed, but last time I looked at radeon/ttm a while back
> > that was totally not the case, and if you don't do all this work then yes
> > you'll deadlock.
> > 
> > Doen't mean it's not impossible, because we've done it :-)
> 
> And I'm pretty sure you didn't do it correctly :D
> 
> > Well, it actually gets the job done. We'd need to at least get to
> > per-object locking, and probably even then we'd need to rewrite the code a
> > lot. But please note that this here is only to avoid the GFP_NOIO
> > constraint, all the other bits I clarified around why we don't actually
> > have circular locking (because the entire hierarchy is inverted for us)
> > still hold even if you would only trylock here.
> 
> Well you reversed your allocation and mmap_sem lock which avoids the lock
> inversion during page faults, but it doesn't help you at all with the MMU
> notifier and shrinker because then there are a lot more locks involved.

I think we're handling all those locks correctly too. At least the
drm_vm_open one is not the locking inversion you're looking for (since we
fixed that, iirc even as part of userptr). Which one is it then?

And I disagree that leaking IOMMU mappings around just because we can't
untangle the locking rules is good engineering. Those IOMMU mappings need
to go, and KVM can pull it off. So should we.
-Daniel

> 
> Regards,
> Christian.
> 
> > Aside: Given that yesterday a bunch of folks complained on #dri-devel that
> > amdgpu prematurely OOMs compared to i915, and that we've switched from a
> > simple trylock to this nastiness to be able to recover from more low
> > memory situation it's maybe not such a silly idea. Horrible, but not silly
> > because actually necessary.
> > -Daniel
>
Daniel Vetter March 26, 2018, 8:01 a.m. UTC | #19
On Thu, Mar 22, 2018 at 10:58:55AM +0100, Christian König wrote:
> Am 22.03.2018 um 08:18 schrieb Daniel Vetter:
> > On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote:
> > > Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
> > > > On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
> > > > > Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
> > > > > > [SNIP]
> > > > > > For the in-driver reservation path (CS) having a slow-path that grabs a
> > > > > > temporary reference, drops the vram lock and then locks the reservation
> > > > > > normally (using the acquire context used already for the entire CS) is a
> > > > > > bit tricky, but totally feasible. Ttm doesn't do that though.
> > > > > That is exactly what we do in amdgpu as well, it's just not very efficient
> > > > > nor reliable to retry getting the right pages for a submission over and over
> > > > > again.
> > > > Out of curiosity, where's that code? I did read the ttm eviction code way
> > > > back, and that one definitely didn't do that. Would be interesting to
> > > > update my understanding.
> > > That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with
> > > grabbing, releasing and regrabbing locks in a loop.
> > > 
> > > Then in amdgpu_cs_submit() we grab an lock preventing page table updates and
> > > check if all pages are still the one we want to have:
> > > >          amdgpu_mn_lock(p->mn);
> > > >          if (p->bo_list) {
> > > >                  for (i = p->bo_list->first_userptr;
> > > >                       i < p->bo_list->num_entries; ++i) {
> > > >                          struct amdgpu_bo *bo = p->bo_list->array[i].robj;
> > > > 
> > > >                          if
> > > > (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) {
> > > >                                  amdgpu_mn_unlock(p->mn);
> > > >                                  return -ERESTARTSYS;
> > > >                          }
> > > >                  }
> > > >          }
> > > If anything changed on the page tables we restart the whole IOCTL using
> > > -ERESTARTSYS and try again.
> > I'm not talking about userptr here, but general bo eviction. Sorry for the
> > confusion.
> > 
> > The reason I'm dragging all the general bo management into this
> > discussions is because we do seem to have fairly fundamental difference in
> > how that's done, with resulting consequences for the locking hierarchy.
> > 
> > And if this invalidate_mapping stuff should work, together with userptr
> > and everything else, I think we're required to agree on how this is all
> > supposed to nest, and how exactly we should back off for the other side
> > that needs to break the locking circle.
> > 
> > That aside, I don't entirely understand why you need to restart so much. I
> > figured that get_user_pages is ordered correctly against mmu
> > invalidations, but I get the impression you think that's not the case. How
> > does that happen?
> 
> Correct. I've had the same assumption, but both Jerome as well as our
> internal tests proved me wrong on that.
> 
> Key take away from that was that you can't take any locks from neither the
> MMU notifier nor the shrinker you also take while calling kmalloc (or
> simpler speaking get_user_pages()).
> 
> Additional to that in the MMU or shrinker callback all different kinds of
> locks might be held, so you basically can't assume that you do thinks like
> recursive page table walks or call dma_unmap_anything.

That sounds like a design bug in mmu_notifiers, since it would render them
useless for KVM. And they were developed for that originally. I think I'll
chat with Jerome to understand this, since it's all rather confusing.

> Thinking a moment about that it actually seems to make perfect sense.
> So it doesn't matter what order you got between the mmap_sem and your buffer
> or allocation lock, it will simply be incorrect with other locks in the
> system anyway.

Hm, doesn't make sense to me, at least from a locking  inversion pov. I
thought the only locks that are definitely part of the mmu_nofitier
callback contexts are the mm locks. We're definitely fine with those, and
kmalloc didn't bite us yet. Direct reclaim is really limited in what it
can accomplish for good reasons.

So the only thing I'm worried here is that we have a race between the
invalidate and the gup calls, and I think fixing that race will make the
locking more fun. But again, if that means you can'd dma_unmap stuff then
that feels like mmu notifiers are broken by design.
-Daniel

> The only solution to that problem we have found is the dance we have in
> amdgpu now:
> 
> 1. Inside invalidate_range_start you grab a recursive read side lock.
> 2. Wait for all GPU operation on that pages to finish.
> 3. Then mark all pages as dirty and accessed.
> 4. Inside invalidate_range_end you release your recursive read side lock
> again.
> 
> Then during command submission you do the following:
> 
> 1. Take the locks Figure out all the userptr BOs which needs new pages.
> 2. Drop the locks again call get_user_pages().
> 3. Install the new page arrays and reiterate with #1
> 4. As soon as everybody has pages update your page tables and prepare all
> command submission.
> 5. Take the write side of our recursive lock.
> 6. Check if all pages are still valid, if not restart the whole IOCTL.
> 7. Submit the job to the hardware.
> 8. Drop the write side of our recursive lock.
> 
> Regards,
> Christian.
> 
> > -Daniel
>
Jerome Glisse March 26, 2018, 3:42 p.m. UTC | #20
On Mon, Mar 26, 2018 at 10:01:21AM +0200, Daniel Vetter wrote:
> On Thu, Mar 22, 2018 at 10:58:55AM +0100, Christian König wrote:
> > Am 22.03.2018 um 08:18 schrieb Daniel Vetter:
> > > On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote:
> > > > Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
> > > > > On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
> > > > > > Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
> > > > > > > [SNIP]
> > > > > > > For the in-driver reservation path (CS) having a slow-path that grabs a
> > > > > > > temporary reference, drops the vram lock and then locks the reservation
> > > > > > > normally (using the acquire context used already for the entire CS) is a
> > > > > > > bit tricky, but totally feasible. Ttm doesn't do that though.
> > > > > > That is exactly what we do in amdgpu as well, it's just not very efficient
> > > > > > nor reliable to retry getting the right pages for a submission over and over
> > > > > > again.
> > > > > Out of curiosity, where's that code? I did read the ttm eviction code way
> > > > > back, and that one definitely didn't do that. Would be interesting to
> > > > > update my understanding.
> > > > That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with
> > > > grabbing, releasing and regrabbing locks in a loop.
> > > > 
> > > > Then in amdgpu_cs_submit() we grab an lock preventing page table updates and
> > > > check if all pages are still the one we want to have:
> > > > >          amdgpu_mn_lock(p->mn);
> > > > >          if (p->bo_list) {
> > > > >                  for (i = p->bo_list->first_userptr;
> > > > >                       i < p->bo_list->num_entries; ++i) {
> > > > >                          struct amdgpu_bo *bo = p->bo_list->array[i].robj;
> > > > > 
> > > > >                          if
> > > > > (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) {
> > > > >                                  amdgpu_mn_unlock(p->mn);
> > > > >                                  return -ERESTARTSYS;
> > > > >                          }
> > > > >                  }
> > > > >          }
> > > > If anything changed on the page tables we restart the whole IOCTL using
> > > > -ERESTARTSYS and try again.
> > > I'm not talking about userptr here, but general bo eviction. Sorry for the
> > > confusion.
> > > 
> > > The reason I'm dragging all the general bo management into this
> > > discussions is because we do seem to have fairly fundamental difference in
> > > how that's done, with resulting consequences for the locking hierarchy.
> > > 
> > > And if this invalidate_mapping stuff should work, together with userptr
> > > and everything else, I think we're required to agree on how this is all
> > > supposed to nest, and how exactly we should back off for the other side
> > > that needs to break the locking circle.
> > > 
> > > That aside, I don't entirely understand why you need to restart so much. I
> > > figured that get_user_pages is ordered correctly against mmu
> > > invalidations, but I get the impression you think that's not the case. How
> > > does that happen?
> > 
> > Correct. I've had the same assumption, but both Jerome as well as our
> > internal tests proved me wrong on that.
> > 
> > Key take away from that was that you can't take any locks from neither the
> > MMU notifier nor the shrinker you also take while calling kmalloc (or
> > simpler speaking get_user_pages()).
> > 
> > Additional to that in the MMU or shrinker callback all different kinds of
> > locks might be held, so you basically can't assume that you do thinks like
> > recursive page table walks or call dma_unmap_anything.
> 
> That sounds like a design bug in mmu_notifiers, since it would render them
> useless for KVM. And they were developed for that originally. I think I'll
> chat with Jerome to understand this, since it's all rather confusing.

Doing dma_unmap() during mmu_notifier callback should be fine, it was last
time i check. However there is no formal contract that it is ok to do so.
But i want to revisit DMA API to something where device driver get control
rather than delegate. High level ideas are:
  - device driver allocate DMA space ie range of physical bus address
  - device driver completely manage the range in whichever ways it wants
    under its own locking.
  - DMA provide API to flush tlb/cache of a range of physical bus address
    with the explicit contract that it can only take short lived spinlock
    to do so and not allocate or free memory while doing so (ie it should
    only be a bunch of registers writes)

I haven't had time to prototype this but i floated the idea a while back
and i am hoping to discuss it some more sooner rather than latter (LSF/MM
is coming soon).


So from a safety point of view i would argue that you must be doing dma_unmap()
from invalidate_range_start() callback. But it all depends on how confident
you are in your hardware mmu and the IOMMU (IOMMU usualy have small cache so
unless there is no PCIE traffic all pending PCIE transaction should post
quicker than it takes the CPU to finish your invalidate_range_start() function.

So for now i would say it is ok to not dma_unmap().


> > Thinking a moment about that it actually seems to make perfect sense.
> > So it doesn't matter what order you got between the mmap_sem and your buffer
> > or allocation lock, it will simply be incorrect with other locks in the
> > system anyway.
> 
> Hm, doesn't make sense to me, at least from a locking  inversion pov. I
> thought the only locks that are definitely part of the mmu_nofitier
> callback contexts are the mm locks. We're definitely fine with those, and
> kmalloc didn't bite us yet. Direct reclaim is really limited in what it
> can accomplish for good reasons.
> 
> So the only thing I'm worried here is that we have a race between the
> invalidate and the gup calls, and I think fixing that race will make the
> locking more fun. But again, if that means you can'd dma_unmap stuff then
> that feels like mmu notifiers are broken by design.
> -Daniel
> 
> > The only solution to that problem we have found is the dance we have in
> > amdgpu now:
> > 
> > 1. Inside invalidate_range_start you grab a recursive read side lock.
> > 2. Wait for all GPU operation on that pages to finish.
> > 3. Then mark all pages as dirty and accessed.
> > 4. Inside invalidate_range_end you release your recursive read side lock
> > again.
> > 
> > Then during command submission you do the following:
> > 
> > 1. Take the locks Figure out all the userptr BOs which needs new pages.
> > 2. Drop the locks again call get_user_pages().
> > 3. Install the new page arrays and reiterate with #1
> > 4. As soon as everybody has pages update your page tables and prepare all
> > command submission.
> > 5. Take the write side of our recursive lock.
> > 6. Check if all pages are still valid, if not restart the whole IOCTL.
> > 7. Submit the job to the hardware.
> > 8. Drop the write side of our recursive lock.

A slightly better solution is using atomic counter:
  driver_range_start() {
    atomic_inc(&mydev->notifier_count);
    lock(mydev->bo_uptr); // protecting bo list/tree
    for_each_uptr_bo() {
       invalidate_bo(bo, start, end);
    }
    unlock(mydev->bo_uptr);
  }

  driver_range_end() {
    atomic_inc(&mydev->notifier_seq);
    atomic_dec(&mydev->notifier_count);
  }

  driver_bo_gup_pages(bo, mydev) {
    unsigned curseq;
    do {
      wait_on_mmu_notifier_count_to_zero();
      curseq = atomic_read(&mydev->notifier_seq);
      for (i = 0; i < npages; i++) {
        pages[i] = gup(addr + (i << PAGE_SHIFT));
      }
      lock(mydev->bo_uptr); // protecting bo list/tree
      if (curseq == atomic_read(&mydev->notifier_seq) &&
          !atomic_read(mydev->notifier_count)) {
         add_uptr_bo(bo, pages);
         bo->seq = curseq;
         unlock(mydev->bo_uptr);
         return 0;
      }
      unlock(mydev->bo_uptr);
      put_all_pages(pages);
    } while (1);
  }

Maybe use KVM as a real example of how to do this (see virt/kvm/kvm_main.c)

Note that however my endgame is bit more simpler. Convert amd, intel to use
HMM CPU page table snapshot tool (and no you do not need page fault for that
you can use that feature independently).

So it would be:
  driver_bo_hmm_invalidate() {
    lock(mydev->bo_uptr); // protecting bo list/tree
    for_each_uptr_bo() {
       invalidate_bo(bo, start, end);
    }
    unlock(mydev->bo_uptr);
  }

  driver_bo_gup_pages(bo, mydev) {
    do {
      hmm_vma_fault(range, bo->start, bo->end, bo->pfns);
      lock(mydev->bo_uptr);
      if (hmm_vma_range_done(range)) {
         add_uptr_bo(bo, pfns);
      }
      unlock(mydev->bo_uptr);
    } while (1);
  }

I would like to see driver using same code, as it means one place to fix
issues. I had for a long time on my TODO list doing the above conversion
to amd or radeon kernel driver. I am pushing up my todo list hopefully in
next few weeks i can send an rfc so people can have a real sense of how
it can look.


Cheers,
Jérôme
Christian König March 27, 2018, 7:35 a.m. UTC | #21
Am 26.03.2018 um 17:42 schrieb Jerome Glisse:
> On Mon, Mar 26, 2018 at 10:01:21AM +0200, Daniel Vetter wrote:
>> On Thu, Mar 22, 2018 at 10:58:55AM +0100, Christian König wrote:
>>> Am 22.03.2018 um 08:18 schrieb Daniel Vetter:
>>> [SNIP]
>>> Key take away from that was that you can't take any locks from neither the
>>> MMU notifier nor the shrinker you also take while calling kmalloc (or
>>> simpler speaking get_user_pages()).
>>>
>>> Additional to that in the MMU or shrinker callback all different kinds of
>>> locks might be held, so you basically can't assume that you do thinks like
>>> recursive page table walks or call dma_unmap_anything.
>> That sounds like a design bug in mmu_notifiers, since it would render them
>> useless for KVM. And they were developed for that originally. I think I'll
>> chat with Jerome to understand this, since it's all rather confusing.
> Doing dma_unmap() during mmu_notifier callback should be fine, it was last
> time i check. However there is no formal contract that it is ok to do so.

As I said before dma_unmap() isn't the real problem here.

The issues is more that you can't take a lock in the MMU notifier which 
you would also take while allocating memory without GFP_NOIO.

That makes it rather tricky to do any command submission, e.g. you need 
to grab all the pages/memory/resources prehand, then make sure that you 
don't have a MMU notifier running concurrently and do the submission.

If any of the prerequisites isn't fulfilled we need to restart the 
operation.

> [SNIP]
> A slightly better solution is using atomic counter:
>    driver_range_start() {
>      atomic_inc(&mydev->notifier_count);
...

Yeah, that is exactly what amdgpu is doing now. Sorry if my description 
didn't made that clear.

> I would like to see driver using same code, as it means one place to fix
> issues. I had for a long time on my TODO list doing the above conversion
> to amd or radeon kernel driver. I am pushing up my todo list hopefully in
> next few weeks i can send an rfc so people can have a real sense of how
> it can look.

Certainly a good idea, but I think we might have that separate to HMM.

TTM suffered really from feature overload, e.g. trying to do everything 
in a single subsystem. And it would be rather nice to have coherent 
userptr handling for GPUs as separate feature.

Regards,
Christian.
Daniel Vetter March 27, 2018, 7:53 a.m. UTC | #22
On Tue, Mar 27, 2018 at 09:35:17AM +0200, Christian König wrote:
> Am 26.03.2018 um 17:42 schrieb Jerome Glisse:
> > On Mon, Mar 26, 2018 at 10:01:21AM +0200, Daniel Vetter wrote:
> > > On Thu, Mar 22, 2018 at 10:58:55AM +0100, Christian König wrote:
> > > > Am 22.03.2018 um 08:18 schrieb Daniel Vetter:
> > > > [SNIP]
> > > > Key take away from that was that you can't take any locks from neither the
> > > > MMU notifier nor the shrinker you also take while calling kmalloc (or
> > > > simpler speaking get_user_pages()).
> > > > 
> > > > Additional to that in the MMU or shrinker callback all different kinds of
> > > > locks might be held, so you basically can't assume that you do thinks like
> > > > recursive page table walks or call dma_unmap_anything.
> > > That sounds like a design bug in mmu_notifiers, since it would render them
> > > useless for KVM. And they were developed for that originally. I think I'll
> > > chat with Jerome to understand this, since it's all rather confusing.
> > Doing dma_unmap() during mmu_notifier callback should be fine, it was last
> > time i check. However there is no formal contract that it is ok to do so.
> 
> As I said before dma_unmap() isn't the real problem here.
> 
> The issues is more that you can't take a lock in the MMU notifier which you
> would also take while allocating memory without GFP_NOIO.
> 
> That makes it rather tricky to do any command submission, e.g. you need to
> grab all the pages/memory/resources prehand, then make sure that you don't
> have a MMU notifier running concurrently and do the submission.
> 
> If any of the prerequisites isn't fulfilled we need to restart the
> operation.

Yeah we're hitting all that epic amount of fun now, after a chat with
Jerome yesterady. I guess we'll figure out what we're coming up with.

> > [SNIP]
> > A slightly better solution is using atomic counter:
> >    driver_range_start() {
> >      atomic_inc(&mydev->notifier_count);
> ...
> 
> Yeah, that is exactly what amdgpu is doing now. Sorry if my description
> didn't made that clear.
> 
> > I would like to see driver using same code, as it means one place to fix
> > issues. I had for a long time on my TODO list doing the above conversion
> > to amd or radeon kernel driver. I am pushing up my todo list hopefully in
> > next few weeks i can send an rfc so people can have a real sense of how
> > it can look.
> 
> Certainly a good idea, but I think we might have that separate to HMM.
> 
> TTM suffered really from feature overload, e.g. trying to do everything in a
> single subsystem. And it would be rather nice to have coherent userptr
> handling for GPUs as separate feature.

TTM suffered from being a midlayer imo, not from doing too much. HMM is
apparently structured like a toolbox (despite its documentation claiming
otherwise), so you can pick&choose freely.
-Daniel
Christian König March 27, 2018, 8:06 a.m. UTC | #23
Am 27.03.2018 um 09:53 schrieb Daniel Vetter:
> [SNIP]
>>> [SNIP]
>>> A slightly better solution is using atomic counter:
>>>     driver_range_start() {
>>>       atomic_inc(&mydev->notifier_count);
>> ...
>>
>> Yeah, that is exactly what amdgpu is doing now. Sorry if my description
>> didn't made that clear.
>>
>>> I would like to see driver using same code, as it means one place to fix
>>> issues. I had for a long time on my TODO list doing the above conversion
>>> to amd or radeon kernel driver. I am pushing up my todo list hopefully in
>>> next few weeks i can send an rfc so people can have a real sense of how
>>> it can look.
>> Certainly a good idea, but I think we might have that separate to HMM.
>>
>> TTM suffered really from feature overload, e.g. trying to do everything in a
>> single subsystem. And it would be rather nice to have coherent userptr
>> handling for GPUs as separate feature.
> TTM suffered from being a midlayer imo, not from doing too much.

Yeah, completely agree.

midlayers work as long as you concentrate on doing exactly one things in 
your midlayer. E.g. in the case of TTM the callback for BO move handling 
is well justified.

Only all the stuff around it like address space handling etc... is 
really wrong designed and should be separated (which is exactly what DRM 
MM did, but TTM still uses this in the wrong way).

> HMM is apparently structured like a toolbox (despite its documentation claiming
> otherwise), so you can pick&choose freely.

That sounds good, but I would still have a better feeling if userptr 
handling would be separated. That avoids mangling things together again.

Christian.

> -Daniel
Daniel Vetter March 27, 2018, 8:27 a.m. UTC | #24
On Tue, Mar 27, 2018 at 10:06:04AM +0200, Christian König wrote:
> Am 27.03.2018 um 09:53 schrieb Daniel Vetter:
> > [SNIP]
> > > > [SNIP]
> > > > A slightly better solution is using atomic counter:
> > > >     driver_range_start() {
> > > >       atomic_inc(&mydev->notifier_count);
> > > ...
> > > 
> > > Yeah, that is exactly what amdgpu is doing now. Sorry if my description
> > > didn't made that clear.
> > > 
> > > > I would like to see driver using same code, as it means one place to fix
> > > > issues. I had for a long time on my TODO list doing the above conversion
> > > > to amd or radeon kernel driver. I am pushing up my todo list hopefully in
> > > > next few weeks i can send an rfc so people can have a real sense of how
> > > > it can look.
> > > Certainly a good idea, but I think we might have that separate to HMM.
> > > 
> > > TTM suffered really from feature overload, e.g. trying to do everything in a
> > > single subsystem. And it would be rather nice to have coherent userptr
> > > handling for GPUs as separate feature.
> > TTM suffered from being a midlayer imo, not from doing too much.
> 
> Yeah, completely agree.
> 
> midlayers work as long as you concentrate on doing exactly one things in
> your midlayer. E.g. in the case of TTM the callback for BO move handling is
> well justified.
> 
> Only all the stuff around it like address space handling etc... is really
> wrong designed and should be separated (which is exactly what DRM MM did,
> but TTM still uses this in the wrong way).

Yeah the addres space allocator part of ttm really is backwards and makes
adding quick driver hacks and heuristics for better allocations schemes
really hard to add. Same for tuning how/what exactly you evict.

> > HMM is apparently structured like a toolbox (despite its documentation claiming
> > otherwise), so you can pick&choose freely.
> 
> That sounds good, but I would still have a better feeling if userptr
> handling would be separated. That avoids mangling things together again.

Jerome said he wants to do at least one prototype conversion of one of the
"I can't fault" userptr implementation over to the suitable subset of HMM
helpers. I guess we'll see once he's submitting the patches, but it
sounded exactly like what the doctor ordered :-)
-Daniel
diff mbox

Patch

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index d78d5fc173dc..ed2b3559ba25 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -572,7 +572,9 @@  struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
 		if (ret)
 			goto err_attach;
 	}
+	reservation_object_lock(dmabuf->resv, NULL);
 	list_add(&attach->node, &dmabuf->attachments);
+	reservation_object_unlock(dmabuf->resv);
 
 	mutex_unlock(&dmabuf->lock);
 	return attach;
@@ -598,7 +600,9 @@  void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach)
 		return;
 
 	mutex_lock(&dmabuf->lock);
+	reservation_object_lock(dmabuf->resv, NULL);
 	list_del(&attach->node);
+	reservation_object_unlock(dmabuf->resv);
 	if (dmabuf->ops->detach)
 		dmabuf->ops->detach(dmabuf, attach);
 
@@ -632,10 +636,23 @@  struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
 	if (WARN_ON(!attach || !attach->dmabuf))
 		return ERR_PTR(-EINVAL);
 
+	/*
+	 * Mapping a DMA-buf can trigger its invalidation, prevent sending this
+	 * event to the caller by temporary removing this attachment from the
+	 * list.
+	 */
+	if (attach->invalidate_mappings) {
+		reservation_object_assert_held(attach->dmabuf->resv);
+		list_del(&attach->node);
+	}
+
 	sg_table = attach->dmabuf->ops->map_dma_buf(attach, direction);
 	if (!sg_table)
 		sg_table = ERR_PTR(-ENOMEM);
 
+	if (attach->invalidate_mappings)
+		list_add(&attach->node, &attach->dmabuf->attachments);
+
 	return sg_table;
 }
 EXPORT_SYMBOL_GPL(dma_buf_map_attachment);
@@ -656,6 +673,9 @@  void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
 {
 	might_sleep();
 
+	if (attach->invalidate_mappings)
+		reservation_object_assert_held(attach->dmabuf->resv);
+
 	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
 		return;
 
@@ -664,6 +684,44 @@  void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
 }
 EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment);
 
+/**
+ * dma_buf_set_invalidate_callback - set the invalidate_mappings callback
+ *
+ * @attach:	[in]	attachment where to set the callback
+ * @cb:		[in]	the callback to set
+ *
+ * Makes sure to take the appropriate locks when updating the invalidate
+ * mappings callback.
+ */
+void dma_buf_set_invalidate_callback(struct dma_buf_attachment *attach,
+				     void (*cb)(struct dma_buf_attachment *))
+{
+	reservation_object_lock(attach->dmabuf->resv, NULL);
+	attach->invalidate_mappings = cb;
+	reservation_object_unlock(attach->dmabuf->resv);
+}
+EXPORT_SYMBOL_GPL(dma_buf_set_invalidate_callback);
+
+/**
+ * dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf
+ *
+ * @dmabuf:	[in]	buffer which mappings should be invalidated
+ *
+ * Informs all attachmenst that they need to destroy and recreated all their
+ * mappings.
+ */
+void dma_buf_invalidate_mappings(struct dma_buf *dmabuf)
+{
+	struct dma_buf_attachment *attach;
+
+	reservation_object_assert_held(dmabuf->resv);
+
+	list_for_each_entry(attach, &dmabuf->attachments, node)
+		if (attach->invalidate_mappings)
+			attach->invalidate_mappings(attach);
+}
+EXPORT_SYMBOL_GPL(dma_buf_invalidate_mappings);
+
 /**
  * DOC: cpu access
  *
@@ -1121,10 +1179,12 @@  static int dma_buf_debug_show(struct seq_file *s, void *unused)
 		seq_puts(s, "\tAttached Devices:\n");
 		attach_count = 0;
 
+		reservation_object_lock(buf_obj->resv, NULL);
 		list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
 			seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
 			attach_count++;
 		}
+		reservation_object_unlock(buf_obj->resv);
 
 		seq_printf(s, "Total %d devices attached\n\n",
 				attach_count);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 085db2fee2d7..70c65fcfe1e3 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -91,6 +91,18 @@  struct dma_buf_ops {
 	 */
 	void (*detach)(struct dma_buf *, struct dma_buf_attachment *);
 
+	/**
+	 * @supports_mapping_invalidation:
+	 *
+	 * True for exporters which supports unpinned DMA-buf operation using
+	 * the reservation lock.
+	 *
+	 * When attachment->invalidate_mappings is set the @map_dma_buf and
+	 * @unmap_dma_buf callbacks can be called with the reservation lock
+	 * held.
+	 */
+	bool supports_mapping_invalidation;
+
 	/**
 	 * @map_dma_buf:
 	 *
@@ -326,6 +338,29 @@  struct dma_buf_attachment {
 	struct device *dev;
 	struct list_head node;
 	void *priv;
+
+	/**
+	 * @invalidate_mappings:
+	 *
+	 * Optional callback provided by the importer of the attachment which
+	 * must be set before mappings are created.
+	 *
+	 * If provided the exporter can avoid pinning the backing store while
+	 * mappings exists.
+	 *
+	 * The function is called with the lock of the reservation object
+	 * associated with the dma_buf held and the mapping function must be
+	 * called with this lock held as well. This makes sure that no mapping
+	 * is created concurrently with an ongoing invalidation.
+	 *
+	 * After the callback all existing mappings are still valid until all
+	 * fences in the dma_bufs reservation object are signaled, but should be
+	 * destroyed by the importer as soon as possible.
+	 *
+	 * New mappings can be created immediately, but can't be used before the
+	 * exclusive fence in the dma_bufs reservation object is signaled.
+	 */
+	void (*invalidate_mappings)(struct dma_buf_attachment *attach);
 };
 
 /**
@@ -391,6 +426,9 @@  struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *,
 					enum dma_data_direction);
 void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *,
 				enum dma_data_direction);
+void dma_buf_set_invalidate_callback(struct dma_buf_attachment *attach,
+				     void (*cb)(struct dma_buf_attachment *));
+void dma_buf_invalidate_mappings(struct dma_buf *dma_buf);
 int dma_buf_begin_cpu_access(struct dma_buf *dma_buf,
 			     enum dma_data_direction dir);
 int dma_buf_end_cpu_access(struct dma_buf *dma_buf,