diff mbox

[11/11] dma-buf: Do a fast lockless check for poll with timeout=0

Message ID 20160829070834.22296-11-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson Aug. 29, 2016, 7:08 a.m. UTC
Currently we install a callback for performing poll on a dma-buf,
irrespective of the timeout. This involves taking a spinlock, as well as
unnecessary work, and greatly reduces scaling of poll(.timeout=0) across
multiple threads.

We can query whether the poll will block prior to installing the
callback to make the busy-query fast.

Single thread: 60% faster
8 threads on 4 (+4 HT) cores: 600% faster

Still not quite the perfect scaling we get with a native busy ioctl, but
poll(dmabuf) is faster due to the quicker lookup of the object and
avoiding drm_ioctl().

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: linux-media@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/dma-buf/dma-buf.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Daniel Vetter Sept. 23, 2016, 1:50 p.m. UTC | #1
On Mon, Aug 29, 2016 at 08:08:34AM +0100, Chris Wilson wrote:
> Currently we install a callback for performing poll on a dma-buf,
> irrespective of the timeout. This involves taking a spinlock, as well as
> unnecessary work, and greatly reduces scaling of poll(.timeout=0) across
> multiple threads.
> 
> We can query whether the poll will block prior to installing the
> callback to make the busy-query fast.
> 
> Single thread: 60% faster
> 8 threads on 4 (+4 HT) cores: 600% faster
> 
> Still not quite the perfect scaling we get with a native busy ioctl, but
> poll(dmabuf) is faster due to the quicker lookup of the object and
> avoiding drm_ioctl().
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: linux-media@vger.kernel.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: linaro-mm-sig@lists.linaro.org
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Need to strike the r-b here, since Christian König pointed out that
objects won't magically switch signalling on.
-Daniel

> ---
>  drivers/dma-buf/dma-buf.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index cf04d249a6a4..c7a7bc579941 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -156,6 +156,18 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
>  	if (!events)
>  		return 0;
>  
> +	if (poll_does_not_wait(poll)) {
> +		if (events & POLLOUT &&
> +		    !reservation_object_test_signaled_rcu(resv, true))
> +			events &= ~(POLLOUT | POLLIN);
> +
> +		if (events & POLLIN &&
> +		    !reservation_object_test_signaled_rcu(resv, false))
> +			events &= ~POLLIN;
> +
> +		return events;
> +	}
> +
>  retry:
>  	seq = read_seqcount_begin(&resv->seq);
>  	rcu_read_lock();
> -- 
> 2.9.3
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson Sept. 23, 2016, 2:15 p.m. UTC | #2
On Fri, Sep 23, 2016 at 03:50:44PM +0200, Daniel Vetter wrote:
> On Mon, Aug 29, 2016 at 08:08:34AM +0100, Chris Wilson wrote:
> > Currently we install a callback for performing poll on a dma-buf,
> > irrespective of the timeout. This involves taking a spinlock, as well as
> > unnecessary work, and greatly reduces scaling of poll(.timeout=0) across
> > multiple threads.
> > 
> > We can query whether the poll will block prior to installing the
> > callback to make the busy-query fast.
> > 
> > Single thread: 60% faster
> > 8 threads on 4 (+4 HT) cores: 600% faster
> > 
> > Still not quite the perfect scaling we get with a native busy ioctl, but
> > poll(dmabuf) is faster due to the quicker lookup of the object and
> > avoiding drm_ioctl().
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > Cc: linux-media@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> Need to strike the r-b here, since Christian König pointed out that
> objects won't magically switch signalling on.

The point being here that we don't even want to switch signaling on! :)

Christian's point was that not all fences guarantee forward progress
irrespective of whether signaling is enabled or not, and fences are not
required to guarantee forward progress without signaling even if they
provide an ops->signaled().
-Chris
Chris Wilson Sept. 23, 2016, 3:06 p.m. UTC | #3
On Fri, Sep 23, 2016 at 03:50:44PM +0200, Daniel Vetter wrote:
> On Mon, Aug 29, 2016 at 08:08:34AM +0100, Chris Wilson wrote:
> > Currently we install a callback for performing poll on a dma-buf,
> > irrespective of the timeout. This involves taking a spinlock, as well as
> > unnecessary work, and greatly reduces scaling of poll(.timeout=0) across
> > multiple threads.
> > 
> > We can query whether the poll will block prior to installing the
> > callback to make the busy-query fast.
> > 
> > Single thread: 60% faster
> > 8 threads on 4 (+4 HT) cores: 600% faster
> > 
> > Still not quite the perfect scaling we get with a native busy ioctl, but
> > poll(dmabuf) is faster due to the quicker lookup of the object and
> > avoiding drm_ioctl().
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > Cc: linux-media@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> Need to strike the r-b here, since Christian König pointed out that
> objects won't magically switch signalling on.

Propagating a flag through to sync_file is trivial, but not through to
the dma_buf->resv. Looks like dma-buf will be without a fast busy query,
which I guess in the grand scheme of things (i.e. dma-buf itself is not
intended to be used as a fence) is not that important.
-Chris
Chris Wilson Sept. 23, 2016, 3:20 p.m. UTC | #4
On Fri, Sep 23, 2016 at 03:50:44PM +0200, Daniel Vetter wrote:
> On Mon, Aug 29, 2016 at 08:08:34AM +0100, Chris Wilson wrote:
> > Currently we install a callback for performing poll on a dma-buf,
> > irrespective of the timeout. This involves taking a spinlock, as well as
> > unnecessary work, and greatly reduces scaling of poll(.timeout=0) across
> > multiple threads.
> > 
> > We can query whether the poll will block prior to installing the
> > callback to make the busy-query fast.
> > 
> > Single thread: 60% faster
> > 8 threads on 4 (+4 HT) cores: 600% faster
> > 
> > Still not quite the perfect scaling we get with a native busy ioctl, but
> > poll(dmabuf) is faster due to the quicker lookup of the object and
> > avoiding drm_ioctl().
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > Cc: linux-media@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> Need to strike the r-b here, since Christian König pointed out that
> objects won't magically switch signalling on.

Oh, it also means that

commit fb8b7d2b9d80e1e71f379e57355936bd2b024be9
Author: Jammy Zhou <Jammy.Zhou@amd.com>
Date:   Wed Jan 21 18:35:47 2015 +0800

    reservation: wait only with non-zero timeout specified (v3)
    
    When the timeout value passed to reservation_object_wait_timeout_rcu
    is zero, no wait should be done if the fences are not signaled.
    
    Return '1' for idle and '0' for busy if the specified timeout is '0'
    to keep consistent with the case of non-zero timeout.
    
    v2: call fence_put if not signaled in the case of timeout==0
    
    v3: switch to reservation_object_test_signaled_rcu
    
    Signed-off-by: Jammy Zhou <Jammy.Zhou@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
    Reviewed-By: Maarten Lankhorst <maarten.lankhorst@canonical.com>
    Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>

is wrong. And reservation_object_test_signaled_rcu() is unreliable.
-Chris
Christian König Sept. 23, 2016, 5:59 p.m. UTC | #5
Am 23.09.2016 um 17:20 schrieb Chris Wilson:
> On Fri, Sep 23, 2016 at 03:50:44PM +0200, Daniel Vetter wrote:
>> On Mon, Aug 29, 2016 at 08:08:34AM +0100, Chris Wilson wrote:
>>> Currently we install a callback for performing poll on a dma-buf,
>>> irrespective of the timeout. This involves taking a spinlock, as well as
>>> unnecessary work, and greatly reduces scaling of poll(.timeout=0) across
>>> multiple threads.
>>>
>>> We can query whether the poll will block prior to installing the
>>> callback to make the busy-query fast.
>>>
>>> Single thread: 60% faster
>>> 8 threads on 4 (+4 HT) cores: 600% faster
>>>
>>> Still not quite the perfect scaling we get with a native busy ioctl, but
>>> poll(dmabuf) is faster due to the quicker lookup of the object and
>>> avoiding drm_ioctl().
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Sumit Semwal <sumit.semwal@linaro.org>
>>> Cc: linux-media@vger.kernel.org
>>> Cc: dri-devel@lists.freedesktop.org
>>> Cc: linaro-mm-sig@lists.linaro.org
>>> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Need to strike the r-b here, since Christian König pointed out that
>> objects won't magically switch signalling on.
> Oh, it also means that
>
> commit fb8b7d2b9d80e1e71f379e57355936bd2b024be9
> Author: Jammy Zhou <Jammy.Zhou@amd.com>
> Date:   Wed Jan 21 18:35:47 2015 +0800
>
>      reservation: wait only with non-zero timeout specified (v3)
>      
>      When the timeout value passed to reservation_object_wait_timeout_rcu
>      is zero, no wait should be done if the fences are not signaled.
>      
>      Return '1' for idle and '0' for busy if the specified timeout is '0'
>      to keep consistent with the case of non-zero timeout.
>      
>      v2: call fence_put if not signaled in the case of timeout==0
>      
>      v3: switch to reservation_object_test_signaled_rcu
>      
>      Signed-off-by: Jammy Zhou <Jammy.Zhou@amd.com>
>      Reviewed-by: Christian König <christian.koenig@amd.com>
>      Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
>      Reviewed-By: Maarten Lankhorst <maarten.lankhorst@canonical.com>
>      Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
>
> is wrong. And reservation_object_test_signaled_rcu() is unreliable.

Ups indeed, that patch is wrong as well.

I suggest that we just enable the signaling in this case as well.

Regards,
Christian.

> -Chris
>

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Vetter Sept. 25, 2016, 8:44 p.m. UTC | #6
On Fri, Sep 23, 2016 at 07:59:44PM +0200, Christian König wrote:
> Am 23.09.2016 um 17:20 schrieb Chris Wilson:
> > On Fri, Sep 23, 2016 at 03:50:44PM +0200, Daniel Vetter wrote:
> > > On Mon, Aug 29, 2016 at 08:08:34AM +0100, Chris Wilson wrote:
> > > > Currently we install a callback for performing poll on a dma-buf,
> > > > irrespective of the timeout. This involves taking a spinlock, as well as
> > > > unnecessary work, and greatly reduces scaling of poll(.timeout=0) across
> > > > multiple threads.
> > > > 
> > > > We can query whether the poll will block prior to installing the
> > > > callback to make the busy-query fast.
> > > > 
> > > > Single thread: 60% faster
> > > > 8 threads on 4 (+4 HT) cores: 600% faster
> > > > 
> > > > Still not quite the perfect scaling we get with a native busy ioctl, but
> > > > poll(dmabuf) is faster due to the quicker lookup of the object and
> > > > avoiding drm_ioctl().
> > > > 
> > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > > > Cc: linux-media@vger.kernel.org
> > > > Cc: dri-devel@lists.freedesktop.org
> > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > > Need to strike the r-b here, since Christian König pointed out that
> > > objects won't magically switch signalling on.
> > Oh, it also means that
> > 
> > commit fb8b7d2b9d80e1e71f379e57355936bd2b024be9
> > Author: Jammy Zhou <Jammy.Zhou@amd.com>
> > Date:   Wed Jan 21 18:35:47 2015 +0800
> > 
> >      reservation: wait only with non-zero timeout specified (v3)
> >      When the timeout value passed to reservation_object_wait_timeout_rcu
> >      is zero, no wait should be done if the fences are not signaled.
> >      Return '1' for idle and '0' for busy if the specified timeout is '0'
> >      to keep consistent with the case of non-zero timeout.
> >      v2: call fence_put if not signaled in the case of timeout==0
> >      v3: switch to reservation_object_test_signaled_rcu
> >      Signed-off-by: Jammy Zhou <Jammy.Zhou@amd.com>
> >      Reviewed-by: Christian König <christian.koenig@amd.com>
> >      Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
> >      Reviewed-By: Maarten Lankhorst <maarten.lankhorst@canonical.com>
> >      Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
> > 
> > is wrong. And reservation_object_test_signaled_rcu() is unreliable.
> 
> Ups indeed, that patch is wrong as well.
> 
> I suggest that we just enable the signaling in this case as well.

Will you/Zhou take care of this corner case? Just so I can't forget about
it ;-)

Thanks, Daniel
diff mbox

Patch

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index cf04d249a6a4..c7a7bc579941 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -156,6 +156,18 @@  static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
 	if (!events)
 		return 0;
 
+	if (poll_does_not_wait(poll)) {
+		if (events & POLLOUT &&
+		    !reservation_object_test_signaled_rcu(resv, true))
+			events &= ~(POLLOUT | POLLIN);
+
+		if (events & POLLIN &&
+		    !reservation_object_test_signaled_rcu(resv, false))
+			events &= ~POLLIN;
+
+		return events;
+	}
+
 retry:
 	seq = read_seqcount_begin(&resv->seq);
 	rcu_read_lock();