[2/4] drm/cache: Try to be smarter about clflushing on x86

Message ID	1418526504-26316-3-git-send-email-benjamin.widawsky@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Ben Widawsky <benjamin.widawsky@intel.com> To: DRI Development <dri-devel@lists.freedesktop.org> Date: Sat, 13 Dec 2014 19:08:22 -0800 Message-Id: <1418526504-26316-3-git-send-email-benjamin.widawsky@intel.com> In-Reply-To: <1418526504-26316-1-git-send-email-benjamin.widawsky@intel.com> References: <1418526504-26316-1-git-send-email-benjamin.widawsky@intel.com> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>, Ben Widawsky <ben@bwidawsk.net>, Ben Widawsky <benjamin.widawsky@intel.com> Subject: [Intel-gfx] [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86 Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Ben Widawsky Dec. 14, 2014, 3:08 a.m. UTC

Any GEM driver which has very large objects and a slow CPU is subject to very
long waits simply for clflushing incoherent objects. Generally, each individual
object is not a problem, but if you have very large objects, or very many
objects, the flushing begins to show up in profiles. Because on x86 we know the
cache size, we can easily determine when an object will use all the cache, and
forego iterating over each cacheline.

We need to be careful when using wbinvd. wbinvd() is itself potentially slow
because it requires synchronizing the flush across all CPUs so they have a
coherent view of memory. This can result in either stalling work being done on
other CPUs, or this call itself stalling while waiting for a CPU to accept the
interrupt. Also, wbinvd() also has the downside of invalidating all cachelines,
so we don't want to use it unless we're sure we already own most of the
cachelines.

The current algorithm is very naive. I think it can be tweaked more, and it
would be good if someone else gave it some thought. I am pretty confident in
i915, we can even skip the IPI in the execbuf path with minimal code change (or
perhaps just some verifying of the existing code). It would be nice to hear what
other developers who depend on this code think.

Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/drm_cache.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

Matt Turner Dec. 14, 2014, 4:15 a.m. UTC | #1

On Sat, Dec 13, 2014 at 7:08 PM, Ben Widawsky
<benjamin.widawsky@intel.com> wrote:
> Any GEM driver which has very large objects and a slow CPU is subject to very
> long waits simply for clflushing incoherent objects. Generally, each individual
> object is not a problem, but if you have very large objects, or very many
> objects, the flushing begins to show up in profiles. Because on x86 we know the
> cache size, we can easily determine when an object will use all the cache, and
> forego iterating over each cacheline.
>
> We need to be careful when using wbinvd. wbinvd() is itself potentially slow
> because it requires synchronizing the flush across all CPUs so they have a
> coherent view of memory. This can result in either stalling work being done on
> other CPUs, or this call itself stalling while waiting for a CPU to accept the
> interrupt. Also, wbinvd() also has the downside of invalidating all cachelines,
> so we don't want to use it unless we're sure we already own most of the
> cachelines.
>
> The current algorithm is very naive. I think it can be tweaked more, and it
> would be good if someone else gave it some thought. I am pretty confident in
> i915, we can even skip the IPI in the execbuf path with minimal code change (or
> perhaps just some verifying of the existing code). It would be nice to hear what
> other developers who depend on this code think.
>
> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> ---
>  drivers/gpu/drm/drm_cache.c | 20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
> index d7797e8..6009c2d 100644
> --- a/drivers/gpu/drm/drm_cache.c
> +++ b/drivers/gpu/drm/drm_cache.c
> @@ -64,6 +64,20 @@ static void drm_cache_flush_clflush(struct page *pages[],
>                 drm_clflush_page(*pages++);
>         mb();
>  }
> +
> +static bool
> +drm_cache_should_clflush(unsigned long num_pages)
> +{
> +       const int cache_size = boot_cpu_data.x86_cache_size;
> +
> +       /* For now the algorithm simply checks if the number of pages to be
> +        * flushed is greater than the entire system cache. One could make the
> +        * function more aware of the actual system (ie. if SMP, how large is
> +        * the cache, CPU freq. etc. All those help to determine when to
> +        * wbinvd() */
> +       WARN_ON_ONCE(!cache_size);
> +       return !cache_size || num_pages < (cache_size >> 2);
> +}
>  #endif
>
>  void
> @@ -71,7 +85,7 @@ drm_clflush_pages(struct page *pages[], unsigned long num_pages)
>  {
>
>  #if defined(CONFIG_X86)
> -       if (cpu_has_clflush) {
> +       if (cpu_has_clflush && drm_cache_should_clflush(num_pages)) {
>                 drm_cache_flush_clflush(pages, num_pages);
>                 return;
>         }
> @@ -104,7 +118,7 @@ void
>  drm_clflush_sg(struct sg_table *st)
>  {
>  #if defined(CONFIG_X86)
> -       if (cpu_has_clflush) {
> +       if (cpu_has_clflush && drm_cache_should_clflush(st->nents)) {
>                 struct sg_page_iter sg_iter;
>
>                 mb();
> @@ -128,7 +142,7 @@ void
>  drm_clflush_virt_range(void *addr, unsigned long length)
>  {
>  #if defined(CONFIG_X86)
> -       if (cpu_has_clflush) {
> +       if (cpu_has_clflush && drm_cache_should_clflush(length / PAGE_SIZE)) {

If length isn't a multiple of page size, isn't this ignoring the
remainder? Should it be rounding length up to the next multiple of
PAGE_SIZE, like ROUND_UP_TO?

Chris Wilson Dec. 14, 2014, 12:59 p.m. UTC | #2

On Sat, Dec 13, 2014 at 07:08:22PM -0800, Ben Widawsky wrote:
> Any GEM driver which has very large objects and a slow CPU is subject to very
> long waits simply for clflushing incoherent objects. Generally, each individual
> object is not a problem, but if you have very large objects, or very many
> objects, the flushing begins to show up in profiles. Because on x86 we know the
> cache size, we can easily determine when an object will use all the cache, and
> forego iterating over each cacheline.
> 
> We need to be careful when using wbinvd. wbinvd() is itself potentially slow
> because it requires synchronizing the flush across all CPUs so they have a
> coherent view of memory. This can result in either stalling work being done on
> other CPUs, or this call itself stalling while waiting for a CPU to accept the
> interrupt. Also, wbinvd() also has the downside of invalidating all cachelines,
> so we don't want to use it unless we're sure we already own most of the
> cachelines.
> 
> The current algorithm is very naive. I think it can be tweaked more, and it
> would be good if someone else gave it some thought. I am pretty confident in
> i915, we can even skip the IPI in the execbuf path with minimal code change (or
> perhaps just some verifying of the existing code). It would be nice to hear what
> other developers who depend on this code think.

One of the things wbinvd is considered evil for is that it blocks the
CPU for an indeterminate amount of time - upsetting latency critcial
aspects of the OS. For example, the x86/mm has similar code to use
wbinvd for large clflushes that caused a bit of controversy with RT:

http://linux-kernel.2935.n7.nabble.com/PATCH-x86-Use-clflush-instead-of-wbinvd-whenever-possible-when-changing-mapping-td493751.html

and also the use of wbinvd in the nvidia driver has also been noted as
evil by RT folks.

However as the wbinvd still exists, it can't be all that bad...

> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> ---
>  drivers/gpu/drm/drm_cache.c | 20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
> index d7797e8..6009c2d 100644
> --- a/drivers/gpu/drm/drm_cache.c
> +++ b/drivers/gpu/drm/drm_cache.c
> @@ -64,6 +64,20 @@ static void drm_cache_flush_clflush(struct page *pages[],
>  		drm_clflush_page(*pages++);
>  	mb();
>  }
> +
> +static bool
> +drm_cache_should_clflush(unsigned long num_pages)
> +{
> +	const int cache_size = boot_cpu_data.x86_cache_size;

Maybe const int cpu_cache_size = boot_cpu_data.x86_cache_size >> 2; /* in pages */

Just to make it clear where the factor of 4 is required.

How stand alone is this patch? What happens if you just apply this by
itself? I presume it wasn't all that effective since you needed the
additional patches to prevent superfluous flushes. But it should have an
effect to reduce the time it takes to bind framebuffers, etc. I expect
it to overall worsen performance as we do the repeated wbinvd in
execbuffer.
-Chris

Jesse Barnes Dec. 15, 2014, 4:06 a.m. UTC | #3

On 12/14/2014 4:59 AM, Chris Wilson wrote:
> One of the things wbinvd is considered evil for is that it blocks the
> CPU for an indeterminate amount of time - upsetting latency critcial
> aspects of the OS. For example, the x86/mm has similar code to use
> wbinvd for large clflushes that caused a bit of controversy with RT:
>
> http://linux-kernel.2935.n7.nabble.com/PATCH-x86-Use-clflush-instead-of-wbinvd-whenever-possible-when-changing-mapping-td493751.html
>
> and also the use of wbinvd in the nvidia driver has also been noted as
> evil by RT folks.
>
> However as the wbinvd still exists, it can't be all that bad...

Yeah there are definitely tradeoffs here.  In this particular case, 
we're trying to flush out a ~140M object on every frame, which just 
seems silly.

There's definitely room for optimization in Mesa too; avoiding a mapping 
that marks a large bo as dirty would be good, but if we improve the 
kernel in this area, even sloppy apps and existing binaries will speed up.

Maybe we could apply this only on !llc systems or something?  I wonder 
how much wbinvd performance varies across microarchitectures; maybe 
Thomas's issue isn't really relevant anymore (at least one can hope).

When digging into this, I found that an optimization to remove the IPI 
for wbinvd was clobbered during a merge; maybe that should be 
resurrected too.  Surely a single, global wbinvd is sufficient; we don't 
need to do n_cpus^2 wbinvd + the associated invalidation bus signals here...

Alternately, we could insert some delays into this path just to make it 
extra clear to userspace that they really shouldn't be hitting this in 
the common case (and provide some additional interfaces to let them 
avoid it by allowing flushing and dirty management in userspace).

Jesse

Ben Widawsky Dec. 15, 2014, 7:54 p.m. UTC | #4

On Sun, Dec 14, 2014 at 08:06:20PM -0800, Jesse Barnes wrote:
> On 12/14/2014 4:59 AM, Chris Wilson wrote:
> >One of the things wbinvd is considered evil for is that it blocks the
> >CPU for an indeterminate amount of time - upsetting latency critcial
> >aspects of the OS. For example, the x86/mm has similar code to use
> >wbinvd for large clflushes that caused a bit of controversy with RT:
> >
> >http://linux-kernel.2935.n7.nabble.com/PATCH-x86-Use-clflush-instead-of-wbinvd-whenever-possible-when-changing-mapping-td493751.html
> >
> >and also the use of wbinvd in the nvidia driver has also been noted as
> >evil by RT folks.
> >
> >However as the wbinvd still exists, it can't be all that bad...

That patch looks eerily similar. I guess I am slightly better in that I take the
cache size into account.

> 
> Yeah there are definitely tradeoffs here.  In this particular case, we're
> trying to flush out a ~140M object on every frame, which just seems silly.
> 
> There's definitely room for optimization in Mesa too; avoiding a mapping
> that marks a large bo as dirty would be good, but if we improve the kernel
> in this area, even sloppy apps and existing binaries will speed up.
> 
> Maybe we could apply this only on !llc systems or something?  I wonder how
> much wbinvd performance varies across microarchitectures; maybe Thomas's
> issue isn't really relevant anymore (at least one can hope).

I am pretty sure from the mesa perspective we do not hit this path on non-llc
systems because we end up with a linear buffer, and just CPU map the buffer. In
addition to trying to manage the cache dirtiness ourselves, the current code
which determines how it wants to manage subregions within large textures could
probably use some eyes to make sure the order in which we decide the various
fallbacks makes sense for all sizes, and all platforms. I /think/ we can do
better, but when I was writing a patch, the code got messy fast.

> 
> When digging into this, I found that an optimization to remove the IPI for
> wbinvd was clobbered during a merge; maybe that should be resurrected too.
> Surely a single, global wbinvd is sufficient; we don't need to do n_cpus^2
> wbinvd + the associated invalidation bus signals here...

If this actually works, then there should be no CPU stall at all.

> 
> Alternately, we could insert some delays into this path just to make it
> extra clear to userspace that they really shouldn't be hitting this in the
> common case (and provide some additional interfaces to let them avoid it by
> allowing flushing and dirty management in userspace).

I don't think such an extreme step is needed. If we really don't need the IPI,
then I think this path can only be faster than CLFLUSH.

> 
> Jesse

Ben Widawsky Dec. 15, 2014, 10:07 p.m. UTC | #5

On Sat, Dec 13, 2014 at 08:15:22PM -0800, Matt Turner wrote:
> On Sat, Dec 13, 2014 at 7:08 PM, Ben Widawsky
> <benjamin.widawsky@intel.com> wrote:
> > Any GEM driver which has very large objects and a slow CPU is subject to very
> > long waits simply for clflushing incoherent objects. Generally, each individual
> > object is not a problem, but if you have very large objects, or very many
> > objects, the flushing begins to show up in profiles. Because on x86 we know the
> > cache size, we can easily determine when an object will use all the cache, and
> > forego iterating over each cacheline.
> >
> > We need to be careful when using wbinvd. wbinvd() is itself potentially slow
> > because it requires synchronizing the flush across all CPUs so they have a
> > coherent view of memory. This can result in either stalling work being done on
> > other CPUs, or this call itself stalling while waiting for a CPU to accept the
> > interrupt. Also, wbinvd() also has the downside of invalidating all cachelines,
> > so we don't want to use it unless we're sure we already own most of the
> > cachelines.
> >
> > The current algorithm is very naive. I think it can be tweaked more, and it
> > would be good if someone else gave it some thought. I am pretty confident in
> > i915, we can even skip the IPI in the execbuf path with minimal code change (or
> > perhaps just some verifying of the existing code). It would be nice to hear what
> > other developers who depend on this code think.
> >
> > Cc: Intel GFX <intel-gfx@lists.freedesktop.org>
> > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > ---
> >  drivers/gpu/drm/drm_cache.c | 20 +++++++++++++++++---
> >  1 file changed, 17 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
> > index d7797e8..6009c2d 100644
> > --- a/drivers/gpu/drm/drm_cache.c
> > +++ b/drivers/gpu/drm/drm_cache.c
> > @@ -64,6 +64,20 @@ static void drm_cache_flush_clflush(struct page *pages[],
> >                 drm_clflush_page(*pages++);
> >         mb();
> >  }
> > +
> > +static bool
> > +drm_cache_should_clflush(unsigned long num_pages)
> > +{
> > +       const int cache_size = boot_cpu_data.x86_cache_size;
> > +
> > +       /* For now the algorithm simply checks if the number of pages to be
> > +        * flushed is greater than the entire system cache. One could make the
> > +        * function more aware of the actual system (ie. if SMP, how large is
> > +        * the cache, CPU freq. etc. All those help to determine when to
> > +        * wbinvd() */
> > +       WARN_ON_ONCE(!cache_size);
> > +       return !cache_size || num_pages < (cache_size >> 2);
> > +}
> >  #endif
> >
> >  void
> > @@ -71,7 +85,7 @@ drm_clflush_pages(struct page *pages[], unsigned long num_pages)
> >  {
> >
> >  #if defined(CONFIG_X86)
> > -       if (cpu_has_clflush) {
> > +       if (cpu_has_clflush && drm_cache_should_clflush(num_pages)) {
> >                 drm_cache_flush_clflush(pages, num_pages);
> >                 return;
> >         }
> > @@ -104,7 +118,7 @@ void
> >  drm_clflush_sg(struct sg_table *st)
> >  {
> >  #if defined(CONFIG_X86)
> > -       if (cpu_has_clflush) {
> > +       if (cpu_has_clflush && drm_cache_should_clflush(st->nents)) {
> >                 struct sg_page_iter sg_iter;
> >
> >                 mb();
> > @@ -128,7 +142,7 @@ void
> >  drm_clflush_virt_range(void *addr, unsigned long length)
> >  {
> >  #if defined(CONFIG_X86)
> > -       if (cpu_has_clflush) {
> > +       if (cpu_has_clflush && drm_cache_should_clflush(length / PAGE_SIZE)) {
> 
> If length isn't a multiple of page size, isn't this ignoring the
> remainder? Should it be rounding length up to the next multiple of
> PAGE_SIZE, like ROUND_UP_TO?

Yeah, we could round_up. In practice it probably won't matter. I actually think
it would be better to pass a size to drm_cache_should_clflush(), and let that
round it up. It sounds like people don't want this patch anyway, so I'll make
the equivalent change in the i915 only patch.

[2/4] drm/cache: Try to be smarter about clflushing on x86

Commit Message

Comments

Patch