Message ID | 20240307133916.3782068-2-yosryahmed@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86/mm: LAM fixups and cleanups | expand |
On Thu, Mar 07, 2024 at 01:39:14PM +0000, Yosry Ahmed wrote: > In switch_mm_irqs_off(), we read the 'mm->context.lam_cr3_mask' into > 'new_lam', which is later passed to load_new_mm_cr3(). However, there is > a call to set_tlbstate_lam_mode() in between which will read > 'mm->context.lam_cr3_mask' again and set 'cpu_tlbstate.lam' accordingly. > If we race with another thread updating 'mm->context.lam_cr3_mask', the > value in 'cpu_tlbstate.lam' could end up being different from CR3. What other thread? LAM can only be enabled when the process has single thread. And cannot be disabled. See MM_CONTEXT_LOCK_LAM. > While we are at it, remove the misguiding comment that states that > 'new_lam' may not match tlbstate_lam_cr3_mask() if a race occurs. The comment is indeed misguiding, but for different reason. It is leftover from the earlier version of LAM patchset.
I know we all have different rules, but any time you could spend absorbing: https://www.kernel.org/doc/html/next/process/maintainer-tip.html would be appreciated, especially: > The condensed patch description in the subject line should start with > a uppercase letter and should be written in imperative tone. On 3/7/24 05:39, Yosry Ahmed wrote: > In switch_mm_irqs_off(), we read the 'mm->context.lam_cr3_mask' into > 'new_lam', which is later passed to load_new_mm_cr3(). However, there is > a call to set_tlbstate_lam_mode() in between which will read > 'mm->context.lam_cr3_mask' again and set 'cpu_tlbstate.lam' accordingly. > If we race with another thread updating 'mm->context.lam_cr3_mask', the > value in 'cpu_tlbstate.lam' could end up being different from CR3. Your description is fine (modulo the we's). But I slightly reworded it to make it more plainly readable: LAM can only be enabled when a process is single-threaded. But _kernel_ threads can temporarily use a single-threaded process's mm. That means that a context-switching kernel thread can race and observe the mm's LAM metadata (mm->context.lam_cr3_mask) change. The context switch code does two logical things with that metadata: populate CR3 and populate 'cpu_tlbstate.lam'. If it hits this race, 'cpu_tlbstate.lam' and CR3 can end up out of sync. This de-synchronization is currently harmless. But it is confusing and might lead to warnings or real bugs. -- > Fix the problem by updating set_tlbstate_lam_mode() to return the LAM > mask that was set to 'cpu_tlbstate.lam', and use that mask in > switch_mm_irqs_off() when writing CR3. Use READ_ONCE to make sure we > read the mask once and use it consistenly. Spell checking is also appreciated. ... > -static inline void set_tlbstate_lam_mode(struct mm_struct *mm) > +static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm) > { > - this_cpu_write(cpu_tlbstate.lam, > - mm->context.lam_cr3_mask >> X86_CR3_LAM_U57_BIT); > + unsigned long lam = READ_ONCE(mm->context.lam_cr3_mask); > + > + this_cpu_write(cpu_tlbstate.lam, lam >> X86_CR3_LAM_U57_BIT); > this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask); > + return lam; > } The comments about races need to be _here_ so that the purpose of the READ_ONCE() is clear. It would also be nice to call out the rule that this can only meaningfully be called once per context switch. > @@ -633,7 +628,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next, > barrier(); > } > > - set_tlbstate_lam_mode(next); > + /* > + * Even if we are not actually switching mm's, another thread could have > + * updated mm->context.lam_cr3_mask. Make sure tlbstate_lam_cr3_mask() > + * and the loaded CR3 use the up-to-date mask. > + */ I kinda dislike how the comment talks about the details of what set_tlbstate_lam_mode() does. It would be much better to put the meat of this comment at the set_tlbstate_lam_mode() definition. > + new_lam = set_tlbstate_lam_mode(next); > if (need_flush) { > this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id); > this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen); This is less a complaint about your change and more of the existing code, but I wish it was more obvious that set_tlbstate_lam_mode() is logically shuffling data (once) from 'next' into the tlbstate. The naming makes it sound like it is modifying the tlbstate of 'next'. But I don't have any particularly brilliant ideas to fix it either. Maybe just: /* new_lam is effectively cpu_tlbstate.lam */ > @@ -705,7 +705,6 @@ void initialize_tlbstate_and_flush(void) > > /* LAM expected to be disabled */ > WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57)); > - WARN_ON(mm_lam_cr3_mask(mm)); > > /* > * Assert that CR4.PCIDE is set if needed. (CR4.PCIDE initialization > @@ -724,7 +723,7 @@ void initialize_tlbstate_and_flush(void) > this_cpu_write(cpu_tlbstate.next_asid, 1); > this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id); > this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen); > - set_tlbstate_lam_mode(mm); > + WARN_ON(set_tlbstate_lam_mode(mm)); The "set_" naming bugs me in both of the sites that get modified here. I'd be with a new name that fits better, if we can think of one.
On Thu, Mar 07, 2024, Dave Hansen wrote: > I know we all have different rules, but any time you could spend absorbing: > > https://www.kernel.org/doc/html/next/process/maintainer-tip.html > > would be appreciated, especially: > > > The condensed patch description in the subject line should start with > > a uppercase letter and should be written in imperative tone. And while you're at it, https://www.kernel.org/doc/html/next/process/maintainer-kvm-x86.html which my crystal ball says will be helpful in your future. ;-)
On Thu, Mar 07, 2024 at 07:22:36PM +0200, Kirill A. Shutemov wrote: > On Thu, Mar 07, 2024 at 01:39:14PM +0000, Yosry Ahmed wrote: > > In switch_mm_irqs_off(), we read the 'mm->context.lam_cr3_mask' into > > 'new_lam', which is later passed to load_new_mm_cr3(). However, there is > > a call to set_tlbstate_lam_mode() in between which will read > > 'mm->context.lam_cr3_mask' again and set 'cpu_tlbstate.lam' accordingly. > > If we race with another thread updating 'mm->context.lam_cr3_mask', the > > value in 'cpu_tlbstate.lam' could end up being different from CR3. > > What other thread? LAM can only be enabled when the process has single > thread. And cannot be disabled. See MM_CONTEXT_LOCK_LAM. Right, but a kthread may run with that single-threaded process's mm IIUC. I think this can happen via kthread_use_mm() or if we context switch directly from the user process to the kthread (context_switch() doesn't seem to update the mm in this case). > > > While we are at it, remove the misguiding comment that states that > > 'new_lam' may not match tlbstate_lam_cr3_mask() if a race occurs. > > The comment is indeed misguiding, but for different reason. It is leftover > from the earlier version of LAM patchset.
On Thu, Mar 07, 2024 at 09:36:58AM -0800, Dave Hansen wrote: > I know we all have different rules, but any time you could spend absorbing: > > https://www.kernel.org/doc/html/next/process/maintainer-tip.html Thanks for the quick review and tips. I didn't know this existed, I will take a look before respinning. > > would be appreciated, especially: > > > The condensed patch description in the subject line should start with > > a uppercase letter and should be written in imperative tone. > > > On 3/7/24 05:39, Yosry Ahmed wrote: > > In switch_mm_irqs_off(), we read the 'mm->context.lam_cr3_mask' into > > 'new_lam', which is later passed to load_new_mm_cr3(). However, there is > > a call to set_tlbstate_lam_mode() in between which will read > > 'mm->context.lam_cr3_mask' again and set 'cpu_tlbstate.lam' accordingly. > > If we race with another thread updating 'mm->context.lam_cr3_mask', the > > value in 'cpu_tlbstate.lam' could end up being different from CR3. > > Your description is fine (modulo the we's). But I slightly reworded it > to make it more plainly readable: > > LAM can only be enabled when a process is single-threaded. But _kernel_ > threads can temporarily use a single-threaded process's mm. That means > that a context-switching kernel thread can race and observe the mm's LAM > metadata (mm->context.lam_cr3_mask) change. > > The context switch code does two logical things with that metadata: > populate CR3 and populate 'cpu_tlbstate.lam'. If it hits this race, > 'cpu_tlbstate.lam' and CR3 can end up out of sync. > > This de-synchronization is currently harmless. But it is confusing and > might lead to warnings or real bugs. Thanks a lot! I will adopt your version moving forward :) > > -- > > > Fix the problem by updating set_tlbstate_lam_mode() to return the LAM > > mask that was set to 'cpu_tlbstate.lam', and use that mask in > > switch_mm_irqs_off() when writing CR3. Use READ_ONCE to make sure we > > read the mask once and use it consistenly. > > Spell checking is also appreciated. I did run checkpatch. Did it miss something? > > ... > > -static inline void set_tlbstate_lam_mode(struct mm_struct *mm) > > +static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm) > > { > > - this_cpu_write(cpu_tlbstate.lam, > > - mm->context.lam_cr3_mask >> X86_CR3_LAM_U57_BIT); > > + unsigned long lam = READ_ONCE(mm->context.lam_cr3_mask); > > + > > + this_cpu_write(cpu_tlbstate.lam, lam >> X86_CR3_LAM_U57_BIT); > > this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask); > > + return lam; > > } > > The comments about races need to be _here_ so that the purpose of the > READ_ONCE() is clear. > > It would also be nice to call out the rule that this can only > meaningfully be called once per context switch. I wanted the comments in switch_mm_irqs_off() where the races actually matter, but I guess I can make the comment more generic and specify that the return value is used to write CR3 so we READ_ONCE keeps CR3 and tlbstate.lam consistent. > > > @@ -633,7 +628,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next, > > barrier(); > > } > > > > - set_tlbstate_lam_mode(next); > > + /* > > + * Even if we are not actually switching mm's, another thread could have > > + * updated mm->context.lam_cr3_mask. Make sure tlbstate_lam_cr3_mask() > > + * and the loaded CR3 use the up-to-date mask. > > + */ > > I kinda dislike how the comment talks about the details of what > set_tlbstate_lam_mode() does. It would be much better to put the meat > of this comment at the set_tlbstate_lam_mode() definition. Agreed. I will move most comments to set_tlbstate_lam_mode(). > > > + new_lam = set_tlbstate_lam_mode(next); > > if (need_flush) { > > this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id); > > this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen); > > This is less a complaint about your change and more of the existing > code, but I wish it was more obvious that set_tlbstate_lam_mode() is > logically shuffling data (once) from 'next' into the tlbstate. > > The naming makes it sound like it is modifying the tlbstate of 'next'. We can update the function name to make it more verbose, maybe something like update_cpu_tlbstate_lam()? We can also try to put "return" somewhere in the name to imply that it returns the LAM mask it sets, but I can't make that look pretty. > > But I don't have any particularly brilliant ideas to fix it either. > Maybe just: > > /* new_lam is effectively cpu_tlbstate.lam */ > > > @@ -705,7 +705,6 @@ void initialize_tlbstate_and_flush(void) > > > > /* LAM expected to be disabled */ > > WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57)); > > - WARN_ON(mm_lam_cr3_mask(mm)); > > > > /* > > * Assert that CR4.PCIDE is set if needed. (CR4.PCIDE initialization > > @@ -724,7 +723,7 @@ void initialize_tlbstate_and_flush(void) > > this_cpu_write(cpu_tlbstate.next_asid, 1); > > this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id); > > this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen); > > - set_tlbstate_lam_mode(mm); > > + WARN_ON(set_tlbstate_lam_mode(mm)); > > The "set_" naming bugs me in both of the sites that get modified here. > I'd be with a new name that fits better, if we can think of one. Is it because it's not clear we are updating cpu_tlbstate (in which case I think update_cpu_tlbstate_lam() is an improvement), or is it because the function returns a value now? If the latter we can put "return" in the name somewhere, or keep the function void and pass in an output parameter.
On Thu, Mar 07, 2024 at 10:49:51AM -0800, Sean Christopherson wrote: > On Thu, Mar 07, 2024, Dave Hansen wrote: > > I know we all have different rules, but any time you could spend absorbing: > > > > https://www.kernel.org/doc/html/next/process/maintainer-tip.html > > > > would be appreciated, especially: > > > > > The condensed patch description in the subject line should start with > > > a uppercase letter and should be written in imperative tone. > > And while you're at it, > > https://www.kernel.org/doc/html/next/process/maintainer-kvm-x86.html > > which my crystal ball says will be helpful in your future. ;-) I am sincerely hoping that there are no contradictions between different maintainer docs, otherwise my brain will break really fast.
On Thu, Mar 07, 2024, Yosry Ahmed wrote: > On Thu, Mar 07, 2024 at 10:49:51AM -0800, Sean Christopherson wrote: > > On Thu, Mar 07, 2024, Dave Hansen wrote: > > > I know we all have different rules, but any time you could spend absorbing: > > > > > > https://www.kernel.org/doc/html/next/process/maintainer-tip.html > > > > > > would be appreciated, especially: > > > > > > > The condensed patch description in the subject line should start with > > > > a uppercase letter and should be written in imperative tone. > > > > And while you're at it, > > > > https://www.kernel.org/doc/html/next/process/maintainer-kvm-x86.html > > > > which my crystal ball says will be helpful in your future. ;-) > > I am sincerely hoping that there are no contradictions between different > maintainer docs, otherwise my brain will break really fast. Heh, as long as you don't contribute to net/, you'll be fine. kvm-x86 mostly just follows tip, with a few minor exceptions. But the exceptions are explicitly documented in kvm-x86, and I am intentionally forgiving/lax on them, because yeah, getting pulled in multiple directions is no fun.
> > The "set_" naming bugs me in both of the sites that get modified here. > > I'd be with a new name that fits better, if we can think of one. > > Is it because it's not clear we are updating cpu_tlbstate (in which case > I think update_cpu_tlbstate_lam() is an improvement), or is it because > the function returns a value now? If the latter we can put "return" in > the name somewhere, or keep the function void and pass in an output > parameter. Or we can avoid returning a value from the helper and avoid passing an mm. The callers would be more verbose, they'll have to call mm_lam_cr3_mask() and mm_untag_mask() and pass the results into the helper (set_tlbstate_lam_mode() or update_cpu_tlbstate_lam()). Another advantage of this is that we can move the READ_ONCE to switch_mm_irqs_off() and keep the comment here. WDYT?
On 3/7/24 15:21, Yosry Ahmed wrote: >>> The "set_" naming bugs me in both of the sites that get modified here. >>> I'd be with a new name that fits better, if we can think of one. >> Is it because it's not clear we are updating cpu_tlbstate (in which case >> I think update_cpu_tlbstate_lam() is an improvement), or is it because >> the function returns a value now? That's part of it. >> If the latter we can put "return" in the name somewhere, or keep >> the function void and pass in an output parameter. No, adding a "_return" won't make it better. > Or we can avoid returning a value from the helper and avoid passing an > mm. The callers would be more verbose, they'll have to call > mm_lam_cr3_mask() and mm_untag_mask() and pass the results into the > helper (set_tlbstate_lam_mode() or update_cpu_tlbstate_lam()). Another > advantage of this is that we can move the READ_ONCE to > switch_mm_irqs_off() and keep the comment here. One thing I don't like about set_tlbstate_lam_mode() is that it's not obvious that it's writing to "cpu_tlbstate" and its right smack in the middle of a bunch of other writers to the same thing. But I'm also not sure that open-coding it at its three call sites makes things better overall. I honestly don't have any brilliant suggestions.
> > Or we can avoid returning a value from the helper and avoid passing an > > mm. The callers would be more verbose, they'll have to call > > mm_lam_cr3_mask() and mm_untag_mask() and pass the results into the > > helper (set_tlbstate_lam_mode() or update_cpu_tlbstate_lam()). Another > > advantage of this is that we can move the READ_ONCE to > > switch_mm_irqs_off() and keep the comment here. > > > One thing I don't like about set_tlbstate_lam_mode() is that it's not > obvious that it's writing to "cpu_tlbstate" and its right smack in the > middle of a bunch of other writers to the same thing. > > But I'm also not sure that open-coding it at its three call sites makes > things better overall. > > I honestly don't have any brilliant suggestions. Let me ponder this a little bit and try to come up with something, I think a max of renaming and open-coding could make an improvement. Thanks!
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 25726893c6f4d..a4ddb20f84fe7 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -399,11 +399,13 @@ static inline u64 tlbstate_lam_cr3_mask(void) return lam << X86_CR3_LAM_U57_BIT; } -static inline void set_tlbstate_lam_mode(struct mm_struct *mm) +static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm) { - this_cpu_write(cpu_tlbstate.lam, - mm->context.lam_cr3_mask >> X86_CR3_LAM_U57_BIT); + unsigned long lam = READ_ONCE(mm->context.lam_cr3_mask); + + this_cpu_write(cpu_tlbstate.lam, lam >> X86_CR3_LAM_U57_BIT); this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask); + return lam; } #else @@ -413,8 +415,9 @@ static inline u64 tlbstate_lam_cr3_mask(void) return 0; } -static inline void set_tlbstate_lam_mode(struct mm_struct *mm) +static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm) { + return 0; } #endif #endif /* !MODULE */ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 51f9f56941058..2975d3f89a5de 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -503,9 +503,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next, { struct mm_struct *prev = this_cpu_read(cpu_tlbstate.loaded_mm); u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); - unsigned long new_lam = mm_lam_cr3_mask(next); bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy); unsigned cpu = smp_processor_id(); + unsigned long new_lam; u64 next_tlb_gen; bool need_flush; u16 new_asid; @@ -561,11 +561,6 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next, VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) != next->context.ctx_id); - /* - * If this races with another thread that enables lam, 'new_lam' - * might not match tlbstate_lam_cr3_mask(). - */ - /* * Even in lazy TLB mode, the CPU should stay set in the * mm_cpumask. The TLB shootdown code can figure out from @@ -633,7 +628,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next, barrier(); } - set_tlbstate_lam_mode(next); + /* + * Even if we are not actually switching mm's, another thread could have + * updated mm->context.lam_cr3_mask. Make sure tlbstate_lam_cr3_mask() + * and the loaded CR3 use the up-to-date mask. + */ + new_lam = set_tlbstate_lam_mode(next); if (need_flush) { this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id); this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen); @@ -705,7 +705,6 @@ void initialize_tlbstate_and_flush(void) /* LAM expected to be disabled */ WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57)); - WARN_ON(mm_lam_cr3_mask(mm)); /* * Assert that CR4.PCIDE is set if needed. (CR4.PCIDE initialization @@ -724,7 +723,7 @@ void initialize_tlbstate_and_flush(void) this_cpu_write(cpu_tlbstate.next_asid, 1); this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id); this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen); - set_tlbstate_lam_mode(mm); + WARN_ON(set_tlbstate_lam_mode(mm)); for (i = 1; i < TLB_NR_DYN_ASIDS; i++) this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
In switch_mm_irqs_off(), we read the 'mm->context.lam_cr3_mask' into 'new_lam', which is later passed to load_new_mm_cr3(). However, there is a call to set_tlbstate_lam_mode() in between which will read 'mm->context.lam_cr3_mask' again and set 'cpu_tlbstate.lam' accordingly. If we race with another thread updating 'mm->context.lam_cr3_mask', the value in 'cpu_tlbstate.lam' could end up being different from CR3. Fix the problem by updating set_tlbstate_lam_mode() to return the LAM mask that was set to 'cpu_tlbstate.lam', and use that mask in switch_mm_irqs_off() when writing CR3. Use READ_ONCE to make sure we read the mask once and use it consistenly. No practical problems have been observed from this, but it's a recipe for future problems (e.g. debug warnings in switch_mm_irqs_off() or __get_current_cr3_fast() could fire). It is unlikely that this can cause any real issues since only a single-threaded process can update its own LAM mask, so the race here could happen when context switching between kthreads using a borrowed MM. In this case, it's unlikely that LAM is important. If it is, then we would need to ensure all CPUs using the mm are updated before returning to userspace when LAM is enabled -- but we don't do that. While we are at it, remove the misguiding comment that states that 'new_lam' may not match tlbstate_lam_cr3_mask() if a race occurs. This can happen without a race, a different thread could have just enabled LAM since the last context switch on the current CPU. Replace it with a hopefully clearer comment above set_tlbstate_lam_mode(). Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- arch/x86/include/asm/tlbflush.h | 11 +++++++---- arch/x86/mm/tlb.c | 17 ++++++++--------- 2 files changed, 15 insertions(+), 13 deletions(-)