x86/mm/tlb: ignore f->new_tlb_gen when zero

Message ID	20220708003053.158480-1-namit@vmware.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Nadav Amit <nadav.amit@gmail.com> To: linux-kernel@vger.kernel.org, Hugh Dickins <hughd@google.com>, Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, x86@kernel.org, Linux MM <linux-mm@kvack.org>, Nadav Amit <namit@vmware.com>, Dave Hansen <dave.hansen@linux.intel.com>, Peter Zijlstra <peterz@infradead.org>, Andy Lutomirski <luto@kernel.org> Subject: [PATCH] x86/mm/tlb: ignore f->new_tlb_gen when zero Date: Thu, 7 Jul 2022 17:30:53 -0700 Message-Id: <20220708003053.158480-1-namit@vmware.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	x86/mm/tlb: ignore f->new_tlb_gen when zero \| expand x86/mm/tlb: ignore f->new_tlb_gen when zero

Nadav Amit July 8, 2022, 12:30 a.m. UTC

From: Nadav Amit <namit@vmware.com>

Commit aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when
possible") introduced an optimization of skipping the flush if the TLB
generation that is flushed (as provided in flush_tlb_info) was already
flushed.

However, arch_tlbbatch_flush() does not provide any generation in
flush_tlb_info. As a result, try_to_unmap_one() would not perform any
TLB flushes.

Fix it by checking whether f->new_tlb_gen is nonzero. Zero value is
anyhow is an invalid generation value.

In addition, add the missing unlikely() and jump to get tracing right.

Fixes: aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when possible")
Reported-by: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

David Hildenbrand July 8, 2022, 11:40 a.m. UTC | #1

On 08.07.22 02:30, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Commit aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when
> possible") introduced an optimization of skipping the flush if the TLB
> generation that is flushed (as provided in flush_tlb_info) was already
> flushed.
> 
> However, arch_tlbbatch_flush() does not provide any generation in
> flush_tlb_info. As a result, try_to_unmap_one() would not perform any
> TLB flushes.
> 
> Fix it by checking whether f->new_tlb_gen is nonzero. Zero value is
> anyhow is an invalid generation value.
> 
> In addition, add the missing unlikely() and jump to get tracing right.
> 
> Fixes: aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when possible")
> Reported-by: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/mm/tlb.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index d9314cc8b81f..d81b4084bb8a 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -771,14 +771,14 @@ static void flush_tlb_func(void *info)
>  		return;
>  	}
>  
> -	if (f->new_tlb_gen <= local_tlb_gen) {
> +	if (unlikely(f->new_tlb_gen != 0 && f->new_tlb_gen <= local_tlb_gen)) {
>  		/*
>  		 * The TLB is already up to date in respect to f->new_tlb_gen.
>  		 * While the core might be still behind mm_tlb_gen, checking
>  		 * mm_tlb_gen unnecessarily would have negative caching effects
>  		 * so avoid it.
>  		 */
> -		return;
> +		goto done;

Does this affect the performance numbers from aa44284960d5 ("x86/mm/tlb:
Avoid reading mm_tlb_gen when possible")?

Dave Hansen July 8, 2022, 2:49 p.m. UTC | #2

On 7/7/22 17:30, Nadav Amit wrote:

You might want to fix the clock on the system from which you sent this.
 I was really scratching my head trying to figure out how you got this
patch out before Hugh's bug report.

> From: Nadav Amit <namit@vmware.com>
> 
> Commit aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when
> possible") introduced an optimization of skipping the flush if the TLB
> generation that is flushed (as provided in flush_tlb_info) was already
> flushed.
> 
> However, arch_tlbbatch_flush() does not provide any generation in
> flush_tlb_info. As a result, try_to_unmap_one() would not perform any
> TLB flushes.
> 
> Fix it by checking whether f->new_tlb_gen is nonzero. Zero value is
> anyhow is an invalid generation value.

It is, but the check below uses 'f->end == TLB_FLUSH_ALL' as the marker
for f->new_tlb_gen being invalid.  Being consistent seems like a good
idea on this stuff.

> In addition, add the missing unlikely() and jump to get tracing right.

There are currently five routes out of flush_tlb_func():
 * Three early returns
 * One 'goto done'
 * One implicit return

The tracing code doesn't get run for any of the early returns, but
that's intentional because they don't *actually* flush the TLB.  We
don't want to record that flush_tlb_func() flushed the TLB when it
didn't.  There's another tracepoint on the TLB_REMOTE_SEND_IPI side to
tell where the flushes were requested.

That said, I think the

	if (unlikely(local_tlb_gen == mm_tlb_gen))
		goto done;

is arguably buggy, as is the 'goto done' in this hunk:

>  arch/x86/mm/tlb.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index d9314cc8b81f..d81b4084bb8a 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -771,14 +771,14 @@ static void flush_tlb_func(void *info)
>  		return;
>  	}
>  
> -	if (f->new_tlb_gen <= local_tlb_gen) {
> +	if (unlikely(f->new_tlb_gen != 0 && f->new_tlb_gen <= local_tlb_gen)) {
>  		/*
>  		 * The TLB is already up to date in respect to f->new_tlb_gen.
>  		 * While the core might be still behind mm_tlb_gen, checking
>  		 * mm_tlb_gen unnecessarily would have negative caching effects
>  		 * so avoid it.
>  		 */
> -		return;
> +		goto done;
>  	}
>  
>  	/*

We might want to (eventually) think about doing something like the
attached patch to make the skipped flushes explicit in the tracing and
make the return paths out of this function more consistent.

Dave Hansen July 8, 2022, 3:13 p.m. UTC | #3

On 7/8/22 04:40, David Hildenbrand wrote:
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index d9314cc8b81f..d81b4084bb8a 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -771,14 +771,14 @@ static void flush_tlb_func(void *info)
>>  		return;
>>  	}
>>  
>> -	if (f->new_tlb_gen <= local_tlb_gen) {
>> +	if (unlikely(f->new_tlb_gen != 0 && f->new_tlb_gen <= local_tlb_gen)) {
>>  		/*
>>  		 * The TLB is already up to date in respect to f->new_tlb_gen.
>>  		 * While the core might be still behind mm_tlb_gen, checking
>>  		 * mm_tlb_gen unnecessarily would have negative caching effects
>>  		 * so avoid it.
>>  		 */
>> -		return;
>> +		goto done;
> Does this affect the performance numbers from aa44284960d5 ("x86/mm/tlb:
> Avoid reading mm_tlb_gen when possible")?

It depends on how many batched flushes that workload had.  From the
looks of it, they're all one page:

	madvise(addr + i, pgsize, MADV_DONTNEED);

so there shouldn't be *much* batching in play.  But, it wouldn't hurt to
re-run them in either case.

Nadav Amit July 8, 2022, 4:54 p.m. UTC | #4

On Jul 8, 2022, at 8:13 AM, Dave Hansen <dave.hansen@intel.com> wrote:

> ⚠ External Email
> 
> On 7/8/22 04:40, David Hildenbrand wrote:
>>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>>> index d9314cc8b81f..d81b4084bb8a 100644
>>> --- a/arch/x86/mm/tlb.c
>>> +++ b/arch/x86/mm/tlb.c
>>> @@ -771,14 +771,14 @@ static void flush_tlb_func(void *info)
>>>             return;
>>>     }
>>> 
>>> -    if (f->new_tlb_gen <= local_tlb_gen) {
>>> +    if (unlikely(f->new_tlb_gen != 0 && f->new_tlb_gen <= local_tlb_gen)) {
>>>             /*
>>>              * The TLB is already up to date in respect to f->new_tlb_gen.
>>>              * While the core might be still behind mm_tlb_gen, checking
>>>              * mm_tlb_gen unnecessarily would have negative caching effects
>>>              * so avoid it.
>>>              */
>>> -            return;
>>> +            goto done;
>> Does this affect the performance numbers from aa44284960d5 ("x86/mm/tlb:
>> Avoid reading mm_tlb_gen when possible")?
> 
> It depends on how many batched flushes that workload had.  From the
> looks of it, they're all one page:
> 
>        madvise(addr + i, pgsize, MADV_DONTNEED);
> 
> so there shouldn't be *much* batching in play.  But, it wouldn't hurt to
> re-run them in either case.

Just to clarify, since these things are confusing.

There are two batching mechanisms. The common one is mmu_gather, which
MADV_DONTNEED uses. This one is *not* the one that caused the breakage.

The second one is the “unmap_batch”, which was only used by x86 until now.
(I just saw patches for ARM, but I think they just exploit the interface in
a way). The “unmap_batch” is used when you swap out. This was broken.

Since the bug was not during MADV_DONTNEED there is no reason for the
results to be any different.

Famous last words?

Dave Hansen July 8, 2022, 5:01 p.m. UTC | #5

On 7/8/22 09:54, Nadav Amit wrote:
> Since the bug was not during MADV_DONTNEED there is no reason for the
> results to be any different.
> 
> Famous last words?

Considering that your patch broke the kernel a way that surprised us
all, I think caution is warranted.  Re-running a microbenchmark that
takes five minutes and stresses things a bit is the least you can do, I
think.

Nadav Amit July 8, 2022, 5:04 p.m. UTC | #6

On Jul 8, 2022, at 7:49 AM, Dave Hansen <dave.hansen@intel.com> wrote:

> ⚠ External Email
> 
> On 7/7/22 17:30, Nadav Amit wrote:
> 
> You might want to fix the clock on the system from which you sent this.
> I was really scratching my head trying to figure out how you got this
> patch out before Hugh's bug report.
> 
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Commit aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when
>> possible") introduced an optimization of skipping the flush if the TLB
>> generation that is flushed (as provided in flush_tlb_info) was already
>> flushed.
>> 
>> However, arch_tlbbatch_flush() does not provide any generation in
>> flush_tlb_info. As a result, try_to_unmap_one() would not perform any
>> TLB flushes.
>> 
>> Fix it by checking whether f->new_tlb_gen is nonzero. Zero value is
>> anyhow is an invalid generation value.
> 
> It is, but the check below uses 'f->end == TLB_FLUSH_ALL' as the marker
> for f->new_tlb_gen being invalid.  Being consistent seems like a good
> idea on this stuff.

If we get a request to do a flush, regardless whether full or partial,
that logically we have already done, there is not reason to do it.

I therefore do not see a reason to look on f->end. I think that looking
at the generation is very intuitive. If you want, I can add a constant
such as TLB_GENERATION_INVALID.

> 
>> In addition, add the missing unlikely() and jump to get tracing right.
> 
> There are currently five routes out of flush_tlb_func():
> * Three early returns
> * One 'goto done'
> * One implicit return
> 
> The tracing code doesn't get run for any of the early returns, but
> that's intentional because they don't *actually* flush the TLB.  We
> don't want to record that flush_tlb_func() flushed the TLB when it
> didn't.  There's another tracepoint on the TLB_REMOTE_SEND_IPI side to
> tell where the flushes were requested.
> 
> That said, I think the
> 
>        if (unlikely(local_tlb_gen == mm_tlb_gen))
>                goto done;
> 
> is arguably buggy, as is the 'goto done' in this hunk:

I was just trying to follow it for consistency. Will remove.

> 
> We might want to (eventually) think about doing something like the
> attached patch to make the skipped flushes explicit in the tracing and
> make the return paths out of this function more consistent.

That’s fine with me. I just recommend that you have a single tracing call in
the function, since having too many ruins the generated code.

Nadav Amit July 8, 2022, 5:09 p.m. UTC | #7

On Jul 8, 2022, at 10:01 AM, Dave Hansen <dave.hansen@intel.com> wrote:

> ⚠ External Email
> 
> On 7/8/22 09:54, Nadav Amit wrote:
>> Since the bug was not during MADV_DONTNEED there is no reason for the
>> results to be any different.
>> 
>> Famous last words?
> 
> Considering that your patch broke the kernel a way that surprised us
> all, I think caution is warranted.  Re-running a microbenchmark that
> takes five minutes and stresses things a bit is the least you can do, I
> think.

I will send it later today. I was just pointing that the failing code-path
is different than the one I measured.

Nadav Amit July 8, 2022, 6:03 p.m. UTC | #8

On Jul 8, 2022, at 10:09 AM, Nadav Amit <namit@vmware.com> wrote:

> On Jul 8, 2022, at 10:01 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> ⚠ External Email
>> 
>> On 7/8/22 09:54, Nadav Amit wrote:
>>> Since the bug was not during MADV_DONTNEED there is no reason for the
>>> results to be any different.
>>> 
>>> Famous last words?
>> 
>> Considering that your patch broke the kernel a way that surprised us
>> all, I think caution is warranted.  Re-running a microbenchmark that
>> takes five minutes and stresses things a bit is the least you can do, I
>> think.
> 
> I will send it later today. I was just pointing that the failing code-path
> is different than the one I measured.

It will take some more time, since 5.19 does not want to boot on my machine,
and results from VMs are meaningless for this patch. I would look into this
unrelated failure, unless you want results from 5.18.

[    6.303945] ------------[ cut here ]------------
[    6.309209] kernel BUG at arch/x86/kernel/apic/apic.c:1598!
[    6.315537] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[    6.321275] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.19.0-rc4TLB+ #5
[    6.328760] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.9.1 12/04/2018
[    6.337236] RIP: 0010:setup_local_APIC+0x31e/0x330
[    6.342686] Code: 01 0f 85 05 ff ff ff 85 d2 7f 2b 48 8b 05 aa 37 4f 01 be 00 07 01 00 bf 50 03 00 00 48 8b 40 10 e8 37 99 fb 00 e9 04 ff ff ff <0f> 0b e8 5b 2d be 00 e9 ba 64 b8 0
[    6.363818] RSP: 0000:ffffffff82603e88 EFLAGS: 00010246
[    6.369752] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    6.377820] RDX: 0000000000000000 RSI: 00000000fffffeff RDI: 0000000000000020
[    6.385888] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff82603da8
[    6.393956] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000031
[    6.402024] R13: 0000000000000000 R14: ffffffff82613118 R15: 0000000000000000
[    6.410091] FS:  0000000000000000(0000) GS:ffff889fff600000(0000) knlGS:0000000000000000
[    6.419250] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.425765] CR2: ffff88c07ffff000 CR3: 000000000260c001 CR4: 00000000000606f0
[    6.433826] Call Trace:
[    6.436646]  <TASK>
[    6.439077]  ? _printk+0x53/0x6a
[    6.442777]  apic_intr_mode_init+0xd2/0xf1
[    6.447448]  x86_late_time_init+0x1b/0x2b
[    6.452019]  start_kernel+0x5d8/0x694
[    6.456194]  secondary_startup_64_no_verify+0xce/0xdb
[    6.461933]  </TASK>
[    6.464463] Modules linked in:
[    6.467979] ---[ end trace 0000000000000000 ]---
[    6.473243] RIP: 0010:setup_local_APIC+0x31e/0x330
[    6.478704] Code: 01 0f 85 05 ff ff ff 85 d2 7f 2b 48 8b 05 aa 37 4f 01 be 00 07 01 00 bf 50 03 00 00 48 8b 40 10 e8 37 99 fb 00 e9 04 ff ff ff <0f> 0b e8 5b 2d be 00 e9 ba 64 b8 0
[    6.499865] RSP: 0000:ffffffff82603e88 EFLAGS: 00010246
[    6.505803] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    6.513887] RDX: 0000000000000000 RSI: 00000000fffffeff RDI: 0000000000000020
[    6.521969] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff82603da8
[    6.530053] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000031
[    6.538136] R13: 0000000000000000 R14: ffffffff82613118 R15: 0000000000000000
[    6.546218] FS:  0000000000000000(0000) GS:ffff889fff600000(0000) knlGS:0000000000000000
[    6.555391] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.561919] CR2: ffff88c07ffff000 CR3: 000000000260c001 CR4: 00000000000606f0
[    6.570003] Kernel panic - not syncing: Attempted to kill the idle task!
[    6.577591] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

Dave Hansen July 8, 2022, 6:54 p.m. UTC | #9

On 7/8/22 10:04, Nadav Amit wrote:
> On Jul 8, 2022, at 7:49 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>> On 7/7/22 17:30, Nadav Amit wrote:
>> You might want to fix the clock on the system from which you sent this.
>> I was really scratching my head trying to figure out how you got this
>> patch out before Hugh's bug report.
>>
>>> From: Nadav Amit <namit@vmware.com>
>>>
>>> Commit aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when
>>> possible") introduced an optimization of skipping the flush if the TLB
>>> generation that is flushed (as provided in flush_tlb_info) was already
>>> flushed.
>>>
>>> However, arch_tlbbatch_flush() does not provide any generation in
>>> flush_tlb_info. As a result, try_to_unmap_one() would not perform any
>>> TLB flushes.
>>>
>>> Fix it by checking whether f->new_tlb_gen is nonzero. Zero value is
>>> anyhow is an invalid generation value.
>>
>> It is, but the check below uses 'f->end == TLB_FLUSH_ALL' as the marker
>> for f->new_tlb_gen being invalid.  Being consistent seems like a good
>> idea on this stuff.
> 
> If we get a request to do a flush, regardless whether full or partial,
> that logically we have already done, there is not reason to do it.
> 
> I therefore do not see a reason to look on f->end. I think that looking
> at the generation is very intuitive. If you want, I can add a constant
> such as TLB_GENERATION_INVALID.

That's a good point.

But, _my_ point was that there was only really one read site of
f->new_tlb_gen in flush_tlb_func().  That site is guarded by the "f->end
!= TLB_FLUSH_ALL" check which prevented it from making the same error
that your patch did.

Whatever we do, it would be nice to have a *single* way to check for
"does f->new_tlb_gen have an actual, valid bit of tlb gen data in it?"

Using something like TLB_GENERATION_INVALID seems reasonable to me.

Hugh Dickins July 8, 2022, 7:21 p.m. UTC | #10

On Thu, 7 Jul 2022, Nadav Amit wrote:

> From: Nadav Amit <namit@vmware.com>
> 
> Commit aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when
> possible") introduced an optimization of skipping the flush if the TLB
> generation that is flushed (as provided in flush_tlb_info) was already
> flushed.
> 
> However, arch_tlbbatch_flush() does not provide any generation in
> flush_tlb_info. As a result, try_to_unmap_one() would not perform any
> TLB flushes.
> 
> Fix it by checking whether f->new_tlb_gen is nonzero. Zero value is
> anyhow is an invalid generation value.
> 
> In addition, add the missing unlikely() and jump to get tracing right.
> 
> Fixes: aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when possible")
> Reported-by: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>

Thanks a lot for your rapid response and thinking it through
(before I got around to any "nopcid" or "nopti" experiments).

I've been testing this one for a few hours now, and no problems seen.
I expect you'll be sending another version, maybe next week, meeting
Dave's concerns; but wanted to reassure that you have correctly
identified the issue and fixed it with this - thanks.

Hugh

> ---
>  arch/x86/mm/tlb.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index d9314cc8b81f..d81b4084bb8a 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -771,14 +771,14 @@ static void flush_tlb_func(void *info)
>  		return;
>  	}
>  
> -	if (f->new_tlb_gen <= local_tlb_gen) {
> +	if (unlikely(f->new_tlb_gen != 0 && f->new_tlb_gen <= local_tlb_gen)) {
>  		/*
>  		 * The TLB is already up to date in respect to f->new_tlb_gen.
>  		 * While the core might be still behind mm_tlb_gen, checking
>  		 * mm_tlb_gen unnecessarily would have negative caching effects
>  		 * so avoid it.
>  		 */
> -		return;
> +		goto done;
>  	}
>  
>  	/*
> -- 
> 2.25.1

Nadav Amit July 8, 2022, 8:02 p.m. UTC | #11

On Jul 8, 2022, at 12:21 PM, Hugh Dickins <hughd@google.com> wrote:

> ⚠ External Email
> 
> On Thu, 7 Jul 2022, Nadav Amit wrote:
> 
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Commit aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when
>> possible") introduced an optimization of skipping the flush if the TLB
>> generation that is flushed (as provided in flush_tlb_info) was already
>> flushed.
>> 
>> However, arch_tlbbatch_flush() does not provide any generation in
>> flush_tlb_info. As a result, try_to_unmap_one() would not perform any
>> TLB flushes.
>> 
>> Fix it by checking whether f->new_tlb_gen is nonzero. Zero value is
>> anyhow is an invalid generation value.
>> 
>> In addition, add the missing unlikely() and jump to get tracing right.
>> 
>> Fixes: aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when possible")
>> Reported-by: Hugh Dickins <hughd@google.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
> 
> Thanks a lot for your rapid response and thinking it through
> (before I got around to any "nopcid" or "nopti" experiments).
> 
> I've been testing this one for a few hours now, and no problems seen.
> I expect you'll be sending another version, maybe next week, meeting
> Dave's concerns; but wanted to reassure that you have correctly
> identified the issue and fixed it with this - thanks.

Thanks, Hugh. Sorry again for my mistake.

Can I please have your “Tested-by”?

Hugh Dickins July 8, 2022, 8:48 p.m. UTC | #12

On Fri, 8 Jul 2022, Nadav Amit wrote:
> On Jul 8, 2022, at 12:21 PM, Hugh Dickins <hughd@google.com> wrote:
> 
> > ⚠ External Email
> > 
> > On Thu, 7 Jul 2022, Nadav Amit wrote:
> > 
> >> From: Nadav Amit <namit@vmware.com>
> >> 
> >> Commit aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when
> >> possible") introduced an optimization of skipping the flush if the TLB
> >> generation that is flushed (as provided in flush_tlb_info) was already
> >> flushed.
> >> 
> >> However, arch_tlbbatch_flush() does not provide any generation in
> >> flush_tlb_info. As a result, try_to_unmap_one() would not perform any
> >> TLB flushes.
> >> 
> >> Fix it by checking whether f->new_tlb_gen is nonzero. Zero value is
> >> anyhow is an invalid generation value.
> >> 
> >> In addition, add the missing unlikely() and jump to get tracing right.
> >> 
> >> Fixes: aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when possible")
> >> Reported-by: Hugh Dickins <hughd@google.com>
> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Cc: Andy Lutomirski <luto@kernel.org>
> >> Signed-off-by: Nadav Amit <namit@vmware.com>
> > 
> > Thanks a lot for your rapid response and thinking it through
> > (before I got around to any "nopcid" or "nopti" experiments).
> > 
> > I've been testing this one for a few hours now, and no problems seen.
> > I expect you'll be sending another version, maybe next week, meeting
> > Dave's concerns; but wanted to reassure that you have correctly
> > identified the issue and fixed it with this - thanks.
> 
> Thanks, Hugh. Sorry again for my mistake.
> 
> Can I please have your “Tested-by”?

Sure, let me scrabble around in my box of tags, yes, here's one:

Tested-by: Hugh Dickins <hughd@google.com>

Nadav Amit July 11, 2022, 5:19 a.m. UTC | #13

On Jul 8, 2022, at 11:54 AM, Dave Hansen <dave.hansen@intel.com> wrote:

> ⚠ External Email
> 
> On 7/8/22 10:04, Nadav Amit wrote:
>> On Jul 8, 2022, at 7:49 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> On 7/7/22 17:30, Nadav Amit wrote:
>>> You might want to fix the clock on the system from which you sent this.
>>> I was really scratching my head trying to figure out how you got this
>>> patch out before Hugh's bug report.
>>> 
>>>> From: Nadav Amit <namit@vmware.com>
>>>> 
>>>> Commit aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when
>>>> possible") introduced an optimization of skipping the flush if the TLB
>>>> generation that is flushed (as provided in flush_tlb_info) was already
>>>> flushed.
>>>> 
>>>> However, arch_tlbbatch_flush() does not provide any generation in
>>>> flush_tlb_info. As a result, try_to_unmap_one() would not perform any
>>>> TLB flushes.
>>>> 
>>>> Fix it by checking whether f->new_tlb_gen is nonzero. Zero value is
>>>> anyhow is an invalid generation value.
>>> 
>>> It is, but the check below uses 'f->end == TLB_FLUSH_ALL' as the marker
>>> for f->new_tlb_gen being invalid.  Being consistent seems like a good
>>> idea on this stuff.
>> 
>> If we get a request to do a flush, regardless whether full or partial,
>> that logically we have already done, there is not reason to do it.
>> 
>> I therefore do not see a reason to look on f->end. I think that looking
>> at the generation is very intuitive. If you want, I can add a constant
>> such as TLB_GENERATION_INVALID.
> 
> That's a good point.
> 
> But, _my_ point was that there was only really one read site of
> f->new_tlb_gen in flush_tlb_func().  That site is guarded by the "f->end
> != TLB_FLUSH_ALL" check which prevented it from making the same error
> that your patch did.
> 
> Whatever we do, it would be nice to have a *single* way to check for
> "does f->new_tlb_gen have an actual, valid bit of tlb gen data in it?"
> 
> Using something like TLB_GENERATION_INVALID seems reasonable to me.

I am not sure that I fully understood what you meant. I understand you do
want TLB_GENERATION_INVALID, and I think you ask for some assertions to
regard to the expected relationship between TLB_GENERATION_INVALID and
TLB_FLUSH_ALL. I think that in order to shorten the discussion, I’ll send v2
(very soon) and you will comment on the patch itself.

I did run the tests again to measure performance as you asked for, and the
results are similar (will-it-scale tlb_flush1/threads): 11% speedup with 45
cores.

Without aa44284960d5:

tasks,processes,processes_idle,threads,threads_idle,linear

0,0,100,0,100,0
1,156782,97.89,157024,97.92,157024
5,707879,89.60,322015,89.69,785120
10,1311968,79.21,490881,79.68,1570240
15,1498762,68.82,553958,69.71,2355360
20,1483057,58.45,598939,60.00,3140480
25,1428105,48.07,626179,50.46,3925600
30,1648992,37.77,643954,41.36,4710720
35,1861301,27.50,671570,32.63,5495840
40,2038278,17.17,669470,24.03,6280960
45,2140412,6.71,673633,15.27,7066080

With aa44284960d5 + pending patch:

0,0,100,0,100,0
1,157935,97.93,155351,97.93,157935
5,710450,89.60,324981,89.70,789675
10,1291935,79.21,498496,79.57,1579350
15,1515323,68.81,575821,69.68,2369025
20,1545172,58.46,625521,60.05,3158700
25,1501015,48.09,675856,50.62,3948375
30,1682308,37.80,705242,41.55,4738050
35,1850464,27.52,717724,32.33,5527725
40,2030726,17.17,734610,23.99,6317400
45,2136401,6.71,747257,16.07,7107075

x86/mm/tlb: ignore f->new_tlb_gen when zero

Commit Message

Comments

Patch