Message ID | 20160107215401.GB25144@nuc-i3427.alporthouse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 01/07/16 13:54, Chris Wilson wrote: > > Whilst you are looking at this asm, note that we reload > boot_cpu_data.x86_cflush_size everytime around the loop. That's a small > but noticeable extra cost (especially when we are only flushing a single > cacheline). > I did notice that; I don't know if this is a compiler failure to do an obvious hoisting (which should then be merged with the other load) or a consequence of the volatile. It might be useful to report this to the gcc bugzilla to let them look at it. Either way, the diff looks good and you can add my: Acked-by: H. Peter Anvin <hpa@linux.intel.com> However, I see absolutely nothing wrong with the assembly other than minor optimization possibilities: since gcc generates an early-out test as well as a late-out test anyway, we could add an explicit: if (p < vend) return; before the first mb() at no additional cost (assuming gcc is smart enough to skip the second test; otherwise that can easily be done manually by replacing the for loop with a do { } while loop. I would be very interested in knowing if replacing the final clflushopt with a clflush would resolve your problems (in which case the last mb() shouldn't be necessary either.) Something like: void clflush_cache_range(void *vaddr, unsigned int size) { unsigned long clflush_size = boot_cpu_data.x86_clflush_size; char *vend = (char *)vaddr + size; char *p = (char *)((unsigned long)vaddr & ~(clflush_size - 1); if (p >= vend) return; mb(); for (; p < vend - clflush_size; p += clflush_size) clflushopt(p); clflush(p); /* Serializing and thus a barrier */ }
On 01/07/16 14:29, H. Peter Anvin wrote: > > I would be very interested in knowing if replacing the final clflushopt > with a clflush would resolve your problems (in which case the last mb() > shouldn't be necessary either.) > Nevermind. CLFLUSH is not ordered with regards to CLFLUSHOPT to the same cache line. Could you add a sync_cpu(); call to the end (can replace the final mb()) and see if that helps your case? -hpa
On 01/07/16 14:32, H. Peter Anvin wrote: > > Nevermind. CLFLUSH is not ordered with regards to CLFLUSHOPT to the > same cache line. > *Except* to the same cache line, dammit. -hpa
On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote: > On 01/07/16 14:29, H. Peter Anvin wrote: > > > > I would be very interested in knowing if replacing the final clflushopt > > with a clflush would resolve your problems (in which case the last mb() > > shouldn't be necessary either.) > > > > Nevermind. CLFLUSH is not ordered with regards to CLFLUSHOPT to the > same cache line. > > Could you add a sync_cpu(); call to the end (can replace the final mb()) > and see if that helps your case? s/sync_cpu()/sync_core()/ No. I still see failures on Baytrail and Braswell (Pineview is not affected) with the final mb() replaced with sync_core(). I can reproduce failures on Pineview by tweaking the clflush_cache_range() parameters, so I am fairly confident that it is validating the current code. iirc sync_core() is cpuid, a heavy serialising instruction, an alternative to mfence. Is there anything that else I can infer about the nature of my bug from this result? -Chris
On Sat, Jan 9, 2016 at 12:01 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote: >> On 01/07/16 14:29, H. Peter Anvin wrote: >> > >> > I would be very interested in knowing if replacing the final clflushopt >> > with a clflush would resolve your problems (in which case the last mb() >> > shouldn't be necessary either.) >> > >> >> Nevermind. CLFLUSH is not ordered with regards to CLFLUSHOPT to the >> same cache line. >> >> Could you add a sync_cpu(); call to the end (can replace the final mb()) >> and see if that helps your case? > > s/sync_cpu()/sync_core()/ > > No. I still see failures on Baytrail and Braswell (Pineview is not > affected) with the final mb() replaced with sync_core(). I can reproduce > failures on Pineview by tweaking the clflush_cache_range() parameters, > so I am fairly confident that it is validating the current code. > > iirc sync_core() is cpuid, a heavy serialising instruction, an > alternative to mfence. Is there anything that else I can infer about > the nature of my bug from this result? No clue, but I don't know much about the underlying architecture. Can you try clflush_cache_ranging one cacheline less and then manually doing clflushopt; mb on the last cache line, just to make sure that the helper is really doing the right thing? You could also try clflush instead of clflushopt to see if that makes a difference. --Andy
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index a3137a4..2cd2b4b 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -129,14 +129,13 @@ within(unsigned long addr, unsigned long start, unsigned long end) */ void clflush_cache_range(void *vaddr, unsigned int size) { - unsigned long clflush_mask = boot_cpu_data.x86_clflush_size - 1; + unsigned long clflush_size = boot_cpu_data.x86_clflush_size; void *vend = vaddr + size; - void *p; + void *p = (void *)((unsigned long)vaddr & ~(clflush_size - 1)); mb(); - for (p = (void *)((unsigned long)vaddr & ~clflush_mask); - p < vend; p += boot_cpu_data.x86_clflush_size) + for (; p < vend; p += clflush_size) clflushopt(p); mb();