Message ID | 20160828163747.32751-1-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Aug 28, 2016 at 05:37:47PM +0100, Chris Wilson wrote: > Currently we install a callback for performing poll on a dma-buf, > irrespective of the timeout. This involves taking a spinlock, as well as > unnecessary work, and greatly reduces scaling of poll(.timeout=0) across > multiple threads. > > We can query whether the poll will block prior to installing the > callback to make the busy-query fast. > > Single thread: 60% faster > 8 threads on 4 (+4 HT) cores: 600% faster > > Still not quite the perfect scaling we get with a native busy ioctl, but > poll(dmabuf) is faster due to the quicker lookup of the object and > avoiding drm_ioctl(). > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Sumit Semwal <sumit.semwal@linaro.org> > Cc: linux-media@vger.kernel.org > Cc: dri-devel@lists.freedesktop.org > Cc: linaro-mm-sig@lists.linaro.org Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> > --- > drivers/dma-buf/dma-buf.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > index cf04d249a6a4..c7a7bc579941 100644 > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c > @@ -156,6 +156,18 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll) > if (!events) > return 0; > > + if (poll_does_not_wait(poll)) { > + if (events & POLLOUT && > + !reservation_object_test_signaled_rcu(resv, true)) > + events &= ~(POLLOUT | POLLIN); > + > + if (events & POLLIN && > + !reservation_object_test_signaled_rcu(resv, false)) > + events &= ~POLLIN; > + > + return events; > + } > + > retry: > seq = read_seqcount_begin(&resv->seq); > rcu_read_lock(); > -- > 2.9.3 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
On Sun, Aug 28, 2016 at 05:37:47PM +0100, Chris Wilson wrote: > Currently we install a callback for performing poll on a dma-buf, > irrespective of the timeout. This involves taking a spinlock, as well as > unnecessary work, and greatly reduces scaling of poll(.timeout=0) across > multiple threads. > > We can query whether the poll will block prior to installing the > callback to make the busy-query fast. > > Single thread: 60% faster > 8 threads on 4 (+4 HT) cores: 600% faster Hmm, this only really applies to the idle case. reservation_object_test_signaled_rcu() is still a major bottleneck when busy, due to the dance inside reservation_object_test_signaled_single() -Chris
On Sun, Aug 28, 2016 at 09:33:54PM +0100, Chris Wilson wrote: > On Sun, Aug 28, 2016 at 05:37:47PM +0100, Chris Wilson wrote: > > Currently we install a callback for performing poll on a dma-buf, > > irrespective of the timeout. This involves taking a spinlock, as well as > > unnecessary work, and greatly reduces scaling of poll(.timeout=0) across > > multiple threads. > > > > We can query whether the poll will block prior to installing the > > callback to make the busy-query fast. > > > > Single thread: 60% faster > > 8 threads on 4 (+4 HT) cores: 600% faster > > Hmm, this only really applies to the idle case. > reservation_object_test_signaled_rcu() is still a major bottleneck when > busy, due to the dance inside reservation_object_test_signaled_single() The fix is not difficult, just requires extending the seqlock to catch the RCU race (i.e. earlier patches). I'll resend that series in the morning. -Chris
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index cf04d249a6a4..c7a7bc579941 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -156,6 +156,18 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll) if (!events) return 0; + if (poll_does_not_wait(poll)) { + if (events & POLLOUT && + !reservation_object_test_signaled_rcu(resv, true)) + events &= ~(POLLOUT | POLLIN); + + if (events & POLLIN && + !reservation_object_test_signaled_rcu(resv, false)) + events &= ~POLLIN; + + return events; + } + retry: seq = read_seqcount_begin(&resv->seq); rcu_read_lock();
Currently we install a callback for performing poll on a dma-buf, irrespective of the timeout. This involves taking a spinlock, as well as unnecessary work, and greatly reduces scaling of poll(.timeout=0) across multiple threads. We can query whether the poll will block prior to installing the callback to make the busy-query fast. Single thread: 60% faster 8 threads on 4 (+4 HT) cores: 600% faster Still not quite the perfect scaling we get with a native busy ioctl, but poll(dmabuf) is faster due to the quicker lookup of the object and avoiding drm_ioctl(). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org --- drivers/dma-buf/dma-buf.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)