x86: Add an explicit barrier() to clflushopt()

Message ID	20160111112801.GR652@nuc-i3427.alporthouse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> Date: Mon, 11 Jan 2016 11:28:01 +0000 From: Chris Wilson <chris@chris-wilson.co.uk> To: Andy Lutomirski <luto@amacapital.net> Subject: Re: [PATCH] x86: Add an explicit barrier() to clflushopt() Message-ID: <20160111112801.GR652@nuc-i3427.alporthouse.com> Mail-Followup-To: Chris Wilson <chris@chris-wilson.co.uk>, Andy Lutomirski <luto@amacapital.net>, "H. Peter Anvin" <hpa@zytor.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, Ross Zwisler <ross.zwisler@linux.intel.com>, "H . Peter Anvin" <hpa@linux.intel.com>, Borislav Petkov <bp@alien8.de>, Brian Gerst <brgerst@gmail.com>, Denys Vlasenko <dvlasenk@redhat.com>, Linus Torvalds <torvalds@linux-foundation.org>, Thomas Gleixner <tglx@linutronix.de>, Imre Deak <imre.deak@intel.com>, Daniel Vetter <daniel.vetter@ffwll.ch>, DRI <dri-devel@lists.freedesktop.org> References: <1445248735-11915-1-git-send-email-chris@chris-wilson.co.uk> <20160107101652.GF652@nuc-i3427.alporthouse.com> <CALCETrU8otVfBe0+N-E0sc=j2L79qh7gqte7QSVb_YzGbJYNCA@mail.gmail.com> <20160107194413.GA25144@nuc-i3427.alporthouse.com> <568ED31F.1090004@zytor.com> <20160107215401.GB25144@nuc-i3427.alporthouse.com> <568EE6E6.4000904@zytor.com> <568EE777.80700@zytor.com> <20160109080138.GG652@nuc-i3427.alporthouse.com> <CALCETrX3_AwH7f4uXp9-casxc9_OB35SvFE9ahGq+yYKPZh04A@mail.gmail.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <CALCETrX3_AwH7f4uXp9-casxc9_OB35SvFE9ahGq+yYKPZh04A@mail.gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Denys Vlasenko <dvlasenk@redhat.com>, Brian Gerst <brgerst@gmail.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, DRI <dri-devel@lists.freedesktop.org>, Daniel Vetter <daniel.vetter@ffwll.ch>, Borislav Petkov <bp@alien8.de>, "H. Peter Anvin" <hpa@zytor.com>, Ross Zwisler <ross.zwisler@linux.intel.com>, "H . Peter Anvin" <hpa@linux.intel.com>, Linus Torvalds <torvalds@linux-foundation.org>, Thomas Gleixner <tglx@linutronix.de> Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Chris Wilson Jan. 11, 2016, 11:28 a.m. UTC

On Sat, Jan 09, 2016 at 02:36:03PM -0800, Andy Lutomirski wrote:
> On Sat, Jan 9, 2016 at 12:01 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote:
> >> On 01/07/16 14:29, H. Peter Anvin wrote:
> >> >
> >> > I would be very interested in knowing if replacing the final clflushopt
> >> > with a clflush would resolve your problems (in which case the last mb()
> >> > shouldn't be necessary either.)
> >> >
> >>
> >> Nevermind.  CLFLUSH is not ordered with regards to CLFLUSHOPT to the
> >> same cache line.
> >>
> >> Could you add a sync_cpu(); call to the end (can replace the final mb())
> >> and see if that helps your case?
> >
> > s/sync_cpu()/sync_core()/
> >
> > No. I still see failures on Baytrail and Braswell (Pineview is not
> > affected) with the final mb() replaced with sync_core(). I can reproduce
> > failures on Pineview by tweaking the clflush_cache_range() parameters,
> > so I am fairly confident that it is validating the current code.
> >
> > iirc sync_core() is cpuid, a heavy serialising instruction, an
> > alternative to mfence.  Is there anything that else I can infer about
> > the nature of my bug from this result?
> 
> No clue, but I don't know much about the underlying architecture.
> 
> Can you try clflush_cache_ranging one cacheline less and then manually
> doing clflushopt; mb on the last cache line, just to make sure that
> the helper is really doing the right thing?  You could also try
> clflush instead of clflushopt to see if that makes a difference.

I had looked at increasing the range over which clflush_cache_range()
runs (using roundup/rounddown by cache lines), but it took something
like +/- 256 bytes to pass all the tests. And also did
s/clflushopt/clflush/ to confirm that made no differnce.

Bizarrely,


works like a charm.
-Chris

Linus Torvalds Jan. 11, 2016, 8:11 p.m. UTC | #1

On Mon, Jan 11, 2016 at 3:28 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> Bizarrely,
>
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index 6000ad7..cf074400 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -141,6 +141,7 @@ void clflush_cache_range(void *vaddr, unsigned int size)
>         for (; p < vend; p += clflush_size)
>                 clflushopt(p);
>
> +       clflushopt(vend-1);
>         mb();
>  }
>  EXPORT_SYMBOL_GPL(clflush_cache_range);
>
> works like a charm.

Have you checked all your callers? If the above makes a difference, it
really sounds like the caller has passed in a size of zero, resulting
in no cache flush, because the caller had incorrect ranges. The
additional clflushopt now flushes the previous cacheline that wasn't
flushed correctly before.

That "size was zero" thing would explain why changing the loop to "p
<= vend" also fixes things for you.

IOW, just how sure are you that all the ranges are correct?

                  Linus

Chris Wilson Jan. 11, 2016, 9:05 p.m. UTC | #2

On Mon, Jan 11, 2016 at 12:11:05PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 3:28 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >
> > Bizarrely,
> >
> > diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> > index 6000ad7..cf074400 100644
> > --- a/arch/x86/mm/pageattr.c
> > +++ b/arch/x86/mm/pageattr.c
> > @@ -141,6 +141,7 @@ void clflush_cache_range(void *vaddr, unsigned int size)
> >         for (; p < vend; p += clflush_size)
> >                 clflushopt(p);
> >
> > +       clflushopt(vend-1);
> >         mb();
> >  }
> >  EXPORT_SYMBOL_GPL(clflush_cache_range);
> >
> > works like a charm.
> 
> Have you checked all your callers? If the above makes a difference, it
> really sounds like the caller has passed in a size of zero, resulting
> in no cache flush, because the caller had incorrect ranges. The
> additional clflushopt now flushes the previous cacheline that wasn't
> flushed correctly before.
> 
> That "size was zero" thing would explain why changing the loop to "p
> <= vend" also fixes things for you.

This is on top of HPA's suggestion to do the size==0 check up front.

> IOW, just how sure are you that all the ranges are correct?

All our callers are of the pattern:

memcpy(dst, vaddr, size)
clflush_cache_range(dst, size)

or 

clflush_cache_range(vaddr, size)
memcpy(dst, vaddr, size)

I am resonably confident that the ranges are sane. I've tried to verify
that we do the clflushes by forcing them. However, if I clflush the
whole object instead of the cachelines around the copies, the tests pass.
(Flushing up to a couple of megabytes instead of a few hundred bytes, it
is hard to draw any conclusions about what the bug might be.)

I can narrow down the principal buggy path by doing the clflush(vend-1)
in the callers at least.

The problem is that the tests that fail are those looking for bugs in
the coherency code, which may just as well be caused by the GPU writing
into those ranges at the same time as the CPU trying to read them. I've
looked into timing and tried adding udelay()s or uncached mmio along the
suspect paths, but that didn't change the presentation - having a udelay
fix the issue is usually a good indicator of a GPU write that hasn't landed
before the CPU read.

The bug only affects a couple of recent non-coherent platforms, earlier
Atoms and older Core seem unaffacted. That may also mean that it is the
GPU flush instruction that changed between platforms and isn't working
(as we intended at least).

Thanks for everyone's help and suggestions,
-Chris

Chris Wilson Jan. 12, 2016, 4:37 p.m. UTC | #3

On Mon, Jan 11, 2016 at 09:05:06PM +0000, Chris Wilson wrote:
> I can narrow down the principal buggy path by doing the clflush(vend-1)
> in the callers at least.

That leads to the suspect path being a read back of a cache line from
main memory that was just written to by the GPU. Writes to memory before
using them on the GPU do not seem to be affected (or at least we have
sufficient flushing in sending the commands to the GPU that we don't
notice anything wrong).

And back to the oddity.

Instead of doing:

	clflush_cache_range(vaddr + offset, size);
	clflush(vaddr+offset+size-1);
	mb();
	memcpy(user, vaddr+offset, size);

what also worked was:

	clflush_cache_range(vaddr + offset, size);
	clflush(vaddr);
	mb();
	memcpy(user, vaddr+offset, size);

(size is definitely non-zero, offset is offset_in_page(), vaddr is from
kmap_atomic()).

i.e.

void clflush_cache_range(void *vaddr, unsigned int size)
{
        const unsigned long clflush_size = boot_cpu_data.x86_clflush_size;
        void *p = (void *)((unsigned long)vaddr & ~(clflush_size - 1));
        void *vend = vaddr + size;

        if (p >= vend)
                return;

        mb();

        for (; p < vend; p += clflush_size)
                clflushopt(p);

	clflushopt(vaddr);

        mb();
}

I have also confirmed that this doesn't just happen for single
cachelines (i.e. where the earlier clflush(vend-1) and this clflush(vaddr)
would be equivalent).

At the moment I am more inclined this is serialising the clflush()
(since clflush to the same cacheline is regarded as ordered with respect
to the earlier clflush iirc) as opposed to the writes not landing timely
from the GPU.

Am I completely going mad?
-Chris

Linus Torvalds Jan. 12, 2016, 5:05 p.m. UTC | #4

On Tue, Jan 12, 2016 at 8:37 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Mon, Jan 11, 2016 at 09:05:06PM +0000, Chris Wilson wrote:
>> I can narrow down the principal buggy path by doing the clflush(vend-1)
>> in the callers at least.
>
> That leads to the suspect path being a read back of a cache line from
> main memory that was just written to by the GPU.

How do you know it was written by the GPU?

Maybe it's a memory ordering issue on the GPU. Say it writes something
to memory, then sets the "I'm done" flag (or whatever you check), but
because of ordering on the GPU the "I'm done" flag is visible before.

So the reason you see the old content may just be that the GPU writes
are still buffered on the GPU. And you adding a clflushopt on the same
address just changes the timing enough that you don't see the memory
ordering any more (or it's just much harder to see, it might still be
there).

Maybe the reason you only see the problem with the last cacheline is
simply that the "last" cacheline is also the last that was written by
the GPU, and it's still in the GPU write buffers.

Also, did you ever print out the value of clflush_size? Maybe we just
got it wrong and it's bogus data.

                    Linus

H. Peter Anvin Jan. 12, 2016, 5:17 p.m. UTC | #5

On January 11, 2016 3:28:01 AM PST, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>On Sat, Jan 09, 2016 at 02:36:03PM -0800, Andy Lutomirski wrote:
>> On Sat, Jan 9, 2016 at 12:01 AM, Chris Wilson
><chris@chris-wilson.co.uk> wrote:
>> > On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote:
>> >> On 01/07/16 14:29, H. Peter Anvin wrote:
>> >> >
>> >> > I would be very interested in knowing if replacing the final
>clflushopt
>> >> > with a clflush would resolve your problems (in which case the
>last mb()
>> >> > shouldn't be necessary either.)
>> >> >
>> >>
>> >> Nevermind.  CLFLUSH is not ordered with regards to CLFLUSHOPT to
>the
>> >> same cache line.
>> >>
>> >> Could you add a sync_cpu(); call to the end (can replace the final
>mb())
>> >> and see if that helps your case?
>> >
>> > s/sync_cpu()/sync_core()/
>> >
>> > No. I still see failures on Baytrail and Braswell (Pineview is not
>> > affected) with the final mb() replaced with sync_core(). I can
>reproduce
>> > failures on Pineview by tweaking the clflush_cache_range()
>parameters,
>> > so I am fairly confident that it is validating the current code.
>> >
>> > iirc sync_core() is cpuid, a heavy serialising instruction, an
>> > alternative to mfence.  Is there anything that else I can infer
>about
>> > the nature of my bug from this result?
>> 
>> No clue, but I don't know much about the underlying architecture.
>> 
>> Can you try clflush_cache_ranging one cacheline less and then
>manually
>> doing clflushopt; mb on the last cache line, just to make sure that
>> the helper is really doing the right thing?  You could also try
>> clflush instead of clflushopt to see if that makes a difference.
>
>I had looked at increasing the range over which clflush_cache_range()
>runs (using roundup/rounddown by cache lines), but it took something
>like +/- 256 bytes to pass all the tests. And also did
>s/clflushopt/clflush/ to confirm that made no differnce.
>
>Bizarrely,
>
>diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
>index 6000ad7..cf074400 100644
>--- a/arch/x86/mm/pageattr.c
>+++ b/arch/x86/mm/pageattr.c
>@@ -141,6 +141,7 @@ void clflush_cache_range(void *vaddr, unsigned int
>size)
>        for (; p < vend; p += clflush_size)
>                clflushopt(p);
> 
>+       clflushopt(vend-1);
>        mb();
> }
> EXPORT_SYMBOL_GPL(clflush_cache_range);
>
>works like a charm.
>-Chris

That clflushopt touches a cache line already touched and therefore serializes with it.

Chris Wilson Jan. 12, 2016, 9:13 p.m. UTC | #6

On Tue, Jan 12, 2016 at 09:05:19AM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 8:37 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Mon, Jan 11, 2016 at 09:05:06PM +0000, Chris Wilson wrote:
> >> I can narrow down the principal buggy path by doing the clflush(vend-1)
> >> in the callers at least.
> >
> > That leads to the suspect path being a read back of a cache line from
> > main memory that was just written to by the GPU.
> 
> How do you know it was written by the GPU?

Test construction: write some data, copy it on the GPU, read it back.
Repeat for various data, sequences of copy (differing GPU instructions,
intermediates etc), and accessors.

> Maybe it's a memory ordering issue on the GPU. Say it writes something
> to memory, then sets the "I'm done" flag (or whatever you check), but
> because of ordering on the GPU the "I'm done" flag is visible before.

That is a continual worry. To try and assuage that fear, I sent 8x
flush gpu writes between the end of the copy and setting the "I'm done"
flag. The definition of the GPU flush is that it both flushes all
previous writes before it completes and only after it completes does it
do the post-sync write (before moving onto the next command). The spec
is always a bit hazy on what order the memory writes will be visible on
the CPU though.

Sending the 8x GPU flushes before marking "I'm done" did not fix the
corruption.

> So the reason you see the old content may just be that the GPU writes
> are still buffered on the GPU. And you adding a clflushopt on the same
> address just changes the timing enough that you don't see the memory
> ordering any more (or it's just much harder to see, it might still be
> there).

Indeed. So I replaced the post-clflush_cache_range() clflush() with a
udelay(10) instead, and the corruption vanished. Putting the udelay(10)
before the clflush_cache_range() does not fix the corruption.

> Maybe the reason you only see the problem with the last cacheline is
> simply that the "last" cacheline is also the last that was written by
> the GPU, and it's still in the GPU write buffers.

Exactly the fear.

> Also, did you ever print out the value of clflush_size? Maybe we just
> got it wrong and it's bogus data.

It's 64 bytes as expected. And fudging it to any other value quickly
explodes :)

Since:

	/* lots of GPU flushes + GPU/CPU sync point */
	udelay(10);
	clflush_cache_range(vaddr, size);
	memcpy(user, vaddr, size);

fails, but

	/* lots of GPU flushes + GPU/CPU sync point */
	clflush_cache_range(vaddr, size);
	udelay(10);
	memcpy(user, vaddr, size);

passes, I'm inclined to point the finger at the mb() following the
clflush_cache_range().
-Chris

Linus Torvalds Jan. 12, 2016, 10:07 p.m. UTC | #7

On Tue, Jan 12, 2016 at 1:13 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> That is a continual worry. To try and assuage that fear, I sent 8x
> flush gpu writes between the end of the copy and setting the "I'm done"
> flag. The definition of the GPU flush is that it both flushes all
> previous writes before it completes and only after it completes does it
> do the post-sync write (before moving onto the next command). The spec
> is always a bit hazy on what order the memory writes will be visible on
> the CPU though.
>
> Sending the 8x GPU flushes before marking "I'm done" did not fix the
> corruption.

Ok. So assuming the GPU flushes are supposed to work, it should be all good.

>> So the reason you see the old content may just be that the GPU writes
>> are still buffered on the GPU. And you adding a clflushopt on the same
>> address just changes the timing enough that you don't see the memory
>> ordering any more (or it's just much harder to see, it might still be
>> there).
>
> Indeed. So I replaced the post-clflush_cache_range() clflush() with a
> udelay(10) instead, and the corruption vanished. Putting the udelay(10)
> before the clflush_cache_range() does not fix the corruption.

Odd.

> passes, I'm inclined to point the finger at the mb() following the
> clflush_cache_range().

We have an entirely unrelated discussion about the value of "mfence"
as a memory barrier.

Mind trying to just make the memory barrier (in
arch/x86/include/asm/barrier.h) be a locked op instead?

The docs say "Executions of the CLFLUSHOPT instruction are ordered
with respect to fence instructions and to locked read-modify-write
instructions; ..", so the mfence should be plenty good enough. But
nobody sane uses mfence for memory ordering (that's the other
discussion we're having), since a locked rmw instruction is faster.

So maybe it's a CPU bug. I'd still consider a GPU memory ordering bug
*way* more likely (the CPU core tensd to be better validated in my
experience), but since you're trying odd things anyway, try changing
the "mfence" to "lock; addl $0,0(%%rsp)" instead.

I doubt it makes any difference, but ..

            Linus

Chris Wilson Jan. 13, 2016, 12:55 a.m. UTC | #8

On Tue, Jan 12, 2016 at 02:07:35PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 1:13 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > Indeed. So I replaced the post-clflush_cache_range() clflush() with a
> > udelay(10) instead, and the corruption vanished. Putting the udelay(10)
> > before the clflush_cache_range() does not fix the corruption.
> 
> Odd.
> 
> > passes, I'm inclined to point the finger at the mb() following the
> > clflush_cache_range().
> 
> We have an entirely unrelated discussion about the value of "mfence"
> as a memory barrier.
> 
> Mind trying to just make the memory barrier (in
> arch/x86/include/asm/barrier.h) be a locked op instead?

Replacing the following mb() in clflush_cache_range() with
"lock; addl $0,0(%%rsp)" gave no improvement. Neither did replacing all
mb(). Undoubtably the memory stream as seen by the CPU is serialised.
The concern then is back to the visibility of the stream from the external
device. Adding an uncached mmio read after the clflush_cache_range()
also works, but I'm not clear if this achieves anything other than
inserting a delay or if it is flushing some other write buffer in the
chipset. It is odd that is required after the clflush_cache_range() and
ineffective before. It is also odd that such a flush would be needed
multiple times after the GPU write.

The double clflush() remains a mystery.
-Chris

Linus Torvalds Jan. 13, 2016, 2:06 a.m. UTC | #9

On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> The double clflush() remains a mystery.

Actually, I think it's explainable.

It's wrong to do the clflush *after* the GPU has done the write, which
seems to be what you are doing.

Why?

If the GPU really isn't cache coherent, what can happen is:

 - the CPU has the line cached

 - the GPU writes the data

 - you do the clflushopt to invalidate the cacheline

 - you expect to see the GPU data.

Right?

Wrong. The above is complete crap.

Why?

Very simple reason: the CPU may have had the cacheline dirty at some
level in its caches, so when you did the clflushopt, it didn't just
invalidate the CPU cacheline, it wrote it back to memory. And in the
process over-wrote the data that the GPU had written.

Now you can say "but the CPU never wrote to the cacheline, so it's not
dirty in the CPU caches". That may or may not be trie. The CPU may
have written to it quite a long time ago.

So if you are doing a GPU write, and you want to see the data that the
GPU wrote, you had better do the clflushopt long *before* the GPU ever
writes to memory.

Your pattern of doing "flush and read" is simply fundamentally buggy.
There are only two valid CPU flushing patterns:

 - write and flush (to make the writes visible to the GPU)

 - flush before starting GPU accesses, and then read

At no point can "flush and read" be right.

Now, I haven't actually seen your code, so I'm just going by your
high-level description of where the CPU flush and CPU read were done,
but it *sounds* like you did that invalid "flush and read" behavior.

                 Linus

Andy Lutomirski Jan. 13, 2016, 2:42 a.m. UTC | #10

On Tue, Jan 12, 2016 at 6:06 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>>
>> The double clflush() remains a mystery.
>
> Actually, I think it's explainable.
>
> It's wrong to do the clflush *after* the GPU has done the write, which
> seems to be what you are doing.
>
> Why?
>
> If the GPU really isn't cache coherent, what can happen is:
>
>  - the CPU has the line cached
>
>  - the GPU writes the data
>
>  - you do the clflushopt to invalidate the cacheline
>
>  - you expect to see the GPU data.
>
> Right?
>
> Wrong. The above is complete crap.
>
> Why?
>
> Very simple reason: the CPU may have had the cacheline dirty at some
> level in its caches, so when you did the clflushopt, it didn't just
> invalidate the CPU cacheline, it wrote it back to memory. And in the
> process over-wrote the data that the GPU had written.
>
> Now you can say "but the CPU never wrote to the cacheline, so it's not
> dirty in the CPU caches". That may or may not be trie. The CPU may
> have written to it quite a long time ago.
>
> So if you are doing a GPU write, and you want to see the data that the
> GPU wrote, you had better do the clflushopt long *before* the GPU ever
> writes to memory.
>
> Your pattern of doing "flush and read" is simply fundamentally buggy.
> There are only two valid CPU flushing patterns:
>
>  - write and flush (to make the writes visible to the GPU)
>
>  - flush before starting GPU accesses, and then read
>
> At no point can "flush and read" be right.
>
> Now, I haven't actually seen your code, so I'm just going by your
> high-level description of where the CPU flush and CPU read were done,
> but it *sounds* like you did that invalid "flush and read" behavior.

Since barriers are on my mind: how strong a barrier is needed to
prevent cache fills from being speculated across the barrier?

Concretely, if I do:

clflush A
clflush B
load A
load B

the architecture guarantees that (unless store forwarding happens) the
value I see for B is at least as new as the value I see for A *with
respect to other access within the coherency domain*.  But the GPU
isn't in the coherency domain at all.

Is:

clflush A
clflush B
load A
MFENCE
load B

good enough?  If it is, and if

clflush A
clflush B
load A
LOCK whatever
load B

is *not*, then this might account for the performance difference.

In any event, it seems to me that what i915 is trying to do isn't
really intended to be supported for WB memory.  i915 really wants to
force a read from main memory and to simultaneously prevent the CPU
from writing back to main memory.  Ick.  I'd assume that:

clflush A
clflush B
load A
serializing instruction here
load B

is good enough, as long as you make sure that the GPU does its writes
after the clflushes make it all the way out to main memory (which
might require a second serializing instruction in the case of
clflushopt), but this still relies on the hardware prefetcher not
prefetching B too early, which it's permitted to do even in the
absence of any explicit access at all.

Presumably this is good enough on any implementation:

clflush A
clflush B
load A
clflush B
load B

But that will be really, really slow.  And you're still screwed if the
hardware is permitted to arbitrarily change cache lines from S to M.

In other words, I'm not really convinced that x86 was ever intended to
have well-defined behavior if something outside the coherency domain
writes to a page of memory while that page is mapped WB.  Of course,
I'm also not sure how to reliably switch a page from WB to any other
memory type short of remapping it and doing CLFLUSH after remapping.

SDM Volume 3 11.12.4 seems to agree with me.

Could the driver be changed to use WC or UC and to use MOVNTDQA on
supported CPUs to get the performance back?  It sounds like i915 is
effectively doing PIO here, and reasonably modern CPUs have a nice set
of fast PIO instructions.

--Andy

Linus Torvalds Jan. 13, 2016, 4:39 a.m. UTC | #11

On Tue, Jan 12, 2016 at 6:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Since barriers are on my mind: how strong a barrier is needed to
> prevent cache fills from being speculated across the barrier?

I don't think there are *any* architectural guarantees.

I suspect that a real serializing instruction should do it. But I
don't think even that is guaranteed.

Non-coherent IO is crazy. I really thought Intel had learnt their
lesson, and finally made all the GPU's coherent. I'm afraid to even
ask why Chris is actually working on some sh*t that requires clflush.

In general, you should probably do something nasty like

 - flush before starting IO that generates data (to make sure you have
no dirty cachelines that will write back and mess up)

 - start the IO, wait for it to complete

 - flush after finishing IO that generates the data (to make sure you
have no speculative clean cachelines with stale data)

 - read the data now.

Of course, what people actually end up doing to avoid all this is to
mark the memory noncacheable.

And finally, the *correct* thing is to not have crap hardware, and
have IO be cache coherent. Things that don't do that are shit. Really.

                 Linus

Chris Wilson Jan. 13, 2016, 12:34 p.m. UTC | #12

On Tue, Jan 12, 2016 at 06:06:34PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >
> > The double clflush() remains a mystery.
> 
> Actually, I think it's explainable.
> 
> It's wrong to do the clflush *after* the GPU has done the write, which
> seems to be what you are doing.
> 
> Why?
> 
> If the GPU really isn't cache coherent, what can happen is:
> 
>  - the CPU has the line cached
> 
>  - the GPU writes the data
> 
>  - you do the clflushopt to invalidate the cacheline
> 
>  - you expect to see the GPU data.
> 
> Right?
> 
> Wrong. The above is complete crap.
> 
> Why?
> 
> Very simple reason: the CPU may have had the cacheline dirty at some
> level in its caches, so when you did the clflushopt, it didn't just
> invalidate the CPU cacheline, it wrote it back to memory. And in the
> process over-wrote the data that the GPU had written.

Forgive me for being dense, but if we overwrite the GPU data in the
backing struct page with the cacheline from the CPU, how do we see the
results from the GPU afterwards?

> Now you can say "but the CPU never wrote to the cacheline, so it's not
> dirty in the CPU caches". That may or may not be trie. The CPU may
> have written to it quite a long time ago.

What we do is we clflush the entire object after it is written to by the
CPU (including newly allocated objects from shmemfs, or pages being
returned to us by shmemfs) before any operation on that object by the
GPU. We have to so that the GPU sees the correct page contents.
(If I change the clflush of the written objects to a wbinvd, that's not
sufficient for the tests to pass.)

We do not clflush the object after we read the backing pages on the CPU
before the next GPU operation, even if it is a GPU write. This leaves us
with clean but stale cachelines. Should. That's why we then
clflush_cache_range() prior to the next read on the object, it is
intended to be a pure cache line invalidation.

If we clflush the entire object between every CPU read back and the *next*
GPU operation, it fails. If we clflush the object before every GPU write
to it, it passes. And to refresh, we always clflush after a CPU write.

I am reasonably confident that any cachelines we dirty (or inherited)
are flushed. What you are suggesting is that there are dirty cachelines
regardless. I am also reasonably confident that even if we clflush the
entire object after touching it before the GPU write, and clflush the
individual cachelines again after the GPU write, we see the errors.

I haven't found the hole yet, or been convincingly able to explain the
differences between gen.
-Chris

Linus Torvalds Jan. 13, 2016, 6:45 p.m. UTC | #13

On Wed, Jan 13, 2016 at 4:34 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> Forgive me for being dense, but if we overwrite the GPU data in the
> backing struct page with the cacheline from the CPU, how do we see the
> results from the GPU afterwards?

Hmm. Good point.

Ok, all the symptoms just say "writes from GPU are delayed and out of order".

Do you have access to the GPU hardware people?

I thought that all the modern Intel GPU's are cache-coherent. If this
is some castrated chip where coherence is removed (perhaps because it
is not working? perhaps config setting?) maybe it needs some extra
ghardware setting to make the GPU "flush" operation actually do
something. In a cache-coherent model, a flush could/should be a noop,
so maybe the hardware is set for that kind of "flush does nothing"
behavior.

Or maybe the GPU is just a buggy pile of crap.

            Linus

x86: Add an explicit barrier() to clflushopt()

Commit Message

Comments

Patch