diff mbox series

[RFC,2/3] x86/mm: make sure LAM is up-to-date during context switching

Message ID 20240307133916.3782068-3-yosryahmed@google.com (mailing list archive)
State New
Headers show
Series x86/mm: LAM fixups and cleanups | expand

Commit Message

Yosry Ahmed March 7, 2024, 1:39 p.m. UTC
During context switching, if we are not switching to new mm and no TLB
flush is needed, we do not write CR3. However, it is possible that a
user thread enables LAM while a kthread is running on a different CPU
with the old LAM CR3 mask. If the kthread context switches into any
thread of that user process, it may not write CR3 with the new LAM mask,
which would cause the user thread to run with a misconfigured CR3 that
disables LAM on the CPU.

Fix this by making sure we write a new CR3 if LAM is not up-to-date. No
problems were observed in practice, this was found by code inspection.

Not that it is possible that mm->context.lam_cr3_mask changes throughout
switch_mm_irqs_off(). But since LAM can only be enabled by a
single-threaded process on its own behalf, in that case we cannot be
switching to a user thread in that same process, we can only be
switching to another kthread using the borrowed mm or a different user
process, which should be fine.

Fixes: 82721d8b25d7 ("x86/mm: Handle LAM on context switch")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 arch/x86/mm/tlb.c | 50 ++++++++++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 20 deletions(-)

Comments

Dave Hansen March 7, 2024, 3:29 p.m. UTC | #1
On 3/7/24 05:39, Yosry Ahmed wrote:
> -		/*
> -		 * Read the tlb_gen to check whether a flush is needed.
> -		 * If the TLB is up to date, just use it.
> -		 * The barrier synchronizes with the tlb_gen increment in
> -		 * the TLB shootdown code.
> -		 */
> -		smp_mb();
> -		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> -		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> -				next_tlb_gen)
> +		if (!need_flush && !need_lam_update)
>  			return;

Instead of all this new complexity, why not just inc_mm_tlb_gen() at the
site where LAM is enabled?  That should signal to any CPU thread that
its TLB is out of date and it needs to do a full CR3 reload.

Also, have you been able to actually catch this scenario in practice?
Considering how fun this code path is, a little effort at an actual
reproduction would be really appreciated.
Kirill A . Shutemov March 7, 2024, 5:29 p.m. UTC | #2
On Thu, Mar 07, 2024 at 01:39:15PM +0000, Yosry Ahmed wrote:
> During context switching, if we are not switching to new mm and no TLB
> flush is needed, we do not write CR3. However, it is possible that a
> user thread enables LAM while a kthread is running on a different CPU
> with the old LAM CR3 mask. If the kthread context switches into any
> thread of that user process, it may not write CR3 with the new LAM mask,
> which would cause the user thread to run with a misconfigured CR3 that
> disables LAM on the CPU.

I don't think it is possible. As I said we can only enable LAM when the
process has single thread. If it enables LAM concurrently with kernel
thread and kernel thread gets control on the same CPU after the userspace
thread of the same process LAM is already going to be enabled. No need in
special handling.
Dave Hansen March 7, 2024, 5:56 p.m. UTC | #3
On 3/7/24 09:29, Kirill A. Shutemov wrote:
> On Thu, Mar 07, 2024 at 01:39:15PM +0000, Yosry Ahmed wrote:
>> During context switching, if we are not switching to new mm and no TLB
>> flush is needed, we do not write CR3. However, it is possible that a
>> user thread enables LAM while a kthread is running on a different CPU
>> with the old LAM CR3 mask. If the kthread context switches into any
>> thread of that user process, it may not write CR3 with the new LAM mask,
>> which would cause the user thread to run with a misconfigured CR3 that
>> disables LAM on the CPU.
> I don't think it is possible. As I said we can only enable LAM when the
> process has single thread. If it enables LAM concurrently with kernel
> thread and kernel thread gets control on the same CPU after the userspace
> thread of the same process LAM is already going to be enabled. No need in
> special handling.

I think it's something logically like this:

						// main thread
	kthread_use_mm()
	cr3 |= mm->lam_cr3_mask;
						mm->lam_cr3_mask = foo;
	cpu_tlbstate.lam = mm->lam_cr3_mask;

Obviously the kthread's LAM state is going to be random.  It's
fundamentally racing with the enabling thread.  That part is fine.

The main pickle is the fact that CR3 and cpu_tlbstate.lam are out of
sync.  That seems worth fixing.

Or is there something else that keeps this whole thing from racing in
the first place?
Yosry Ahmed March 7, 2024, 9:04 p.m. UTC | #4
On Thu, Mar 07, 2024 at 07:29:34AM -0800, Dave Hansen wrote:
> On 3/7/24 05:39, Yosry Ahmed wrote:
> > -		/*
> > -		 * Read the tlb_gen to check whether a flush is needed.
> > -		 * If the TLB is up to date, just use it.
> > -		 * The barrier synchronizes with the tlb_gen increment in
> > -		 * the TLB shootdown code.
> > -		 */
> > -		smp_mb();
> > -		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> > -		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> > -				next_tlb_gen)
> > +		if (!need_flush && !need_lam_update)
> >  			return;
> 
> Instead of all this new complexity, why not just inc_mm_tlb_gen() at the
> site where LAM is enabled?  That should signal to any CPU thread that
> its TLB is out of date and it needs to do a full CR3 reload.

It's not really a lot of complexity imo, most of the patch is reverting
the if conditions that return if a TLB flush is not needed to have a
single if block that sets need_flush to true. I can split this to a
different patch if it helps review, or I can do something like this to
keep the diff concise:

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2975d3f89a5de..17ab105522287 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -576,7 +576,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
                 * process. No TLB flush required.
                 */
                if (!was_lazy)
-                       return;
+                       goto check_return;

                /*
                 * Read the tlb_gen to check whether a flush is needed.
@@ -588,7 +588,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
                next_tlb_gen = atomic64_read(&next->context.tlb_gen);
                if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
                                next_tlb_gen)
-                       return;
+                       goto check_return;

                /*
                 * TLB contents went out of date while we were in lazy
@@ -596,6 +596,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
                 */
                new_asid = prev_asid;
                need_flush = true;
+check_return:
+               if (!need_flush && /* LAM up-to-date */)
+                       return;
        } else {
                /*
                 * Apply process to process speculation vulnerability

..but I prefer the current patch though to avoid the goto. I think the
flow is easier to follow.

I thought about doing inc_mm_tlb_gen() when LAM is updated, but it felt
hacky and more importantly doesn't make it clear in switch_mm_irqs_off()
that we correctly handle LAM updates. We can certainly add a comment,
but I think an explicit check for CPU LAM vs. mm LAM is much clearer.

WDYT?

> 
> Also, have you been able to actually catch this scenario in practice?

Nope, by code inspection (as I admitted in the commit log).

> Considering how fun this code path is, a little effort at an actual
> reproduction would be really appreciated.

I tried reproducing it but gave up quickly. We need a certain sequence
of events to happen:

CPU 1					CPU 2
kthread_use_mm()
					/* user thread enables LAM */
					context_switch()
context_switch() /* to user thread */


IIUC we don't really need kthread_use_mm(), we need the kthread to be
scheduled on the CPU directly after the user thread, so maybe something
like:

CPU 1					CPU 2
/* user thread running */
context_switch() /* to kthread */
					/* user thread enables LAM */
					context_switch()
context_switch() /* to user thread */

It doesn't seem easy to reproduce. WDYT?
Yosry Ahmed March 7, 2024, 9:08 p.m. UTC | #5
On Thu, Mar 07, 2024 at 09:56:07AM -0800, Dave Hansen wrote:
> On 3/7/24 09:29, Kirill A. Shutemov wrote:
> > On Thu, Mar 07, 2024 at 01:39:15PM +0000, Yosry Ahmed wrote:
> >> During context switching, if we are not switching to new mm and no TLB
> >> flush is needed, we do not write CR3. However, it is possible that a
> >> user thread enables LAM while a kthread is running on a different CPU
> >> with the old LAM CR3 mask. If the kthread context switches into any
> >> thread of that user process, it may not write CR3 with the new LAM mask,
> >> which would cause the user thread to run with a misconfigured CR3 that
> >> disables LAM on the CPU.
> > I don't think it is possible. As I said we can only enable LAM when the
> > process has single thread. If it enables LAM concurrently with kernel
> > thread and kernel thread gets control on the same CPU after the userspace
> > thread of the same process LAM is already going to be enabled. No need in
> > special handling.
> 
> I think it's something logically like this:
> 
> 						// main thread
> 	kthread_use_mm()
> 	cr3 |= mm->lam_cr3_mask;
> 						mm->lam_cr3_mask = foo;
> 	cpu_tlbstate.lam = mm->lam_cr3_mask;

IIUC it doesn't have to be through kthread_use_mm(). If we context
switch directly from the user thread to a kthread, the kthread will keep
using the user thread's mm AFAICT.

> 
> Obviously the kthread's LAM state is going to be random.  It's
> fundamentally racing with the enabling thread.  That part is fine.
> 
> The main pickle is the fact that CR3 and cpu_tlbstate.lam are out of
> sync.  That seems worth fixing.

That's what is fixed by patch 1, specifically a race between
switch_mm_irqs_off() and LAM being enabled. This patch is fixing a
different problem:

CPU 1                                   CPU 2
/* user thread running */
context_switch() /* to kthread */
                                        /* user thread enables LAM */
                                        context_switch()
context_switch() /* to user thread */

In this case, there are no races, but the second context switch on CPU 1
may not write CR3 (if TLB is up-to-date), in which case we will run the
user thread with CR3 having the wrong LAM mask. This could cause bigger
problems, right?

> 
> Or is there something else that keeps this whole thing from racing in
> the first place?

+1 that would be good to know, but I didn't find anything.
Dave Hansen March 7, 2024, 9:39 p.m. UTC | #6
On 3/7/24 13:04, Yosry Ahmed wrote:
> I thought about doing inc_mm_tlb_gen() when LAM is updated, but it felt
> hacky and more importantly doesn't make it clear in switch_mm_irqs_off()
> that we correctly handle LAM updates. We can certainly add a comment,
> but I think an explicit check for CPU LAM vs. mm LAM is much clearer.
> 
> WDYT?

The mm generations are literally there so that if the mm changes that
all the CPUs know they need an update.  Changing LAM enabling is 100%
consistent with telling other CPUs that they need an update.

I'd be curious of Andy feels differently though.

>> Considering how fun this code path is, a little effort at an actual
>> reproduction would be really appreciated.
> 
> I tried reproducing it but gave up quickly. We need a certain sequence
> of events to happen:
> 
> CPU 1					CPU 2
> kthread_use_mm()
> 					/* user thread enables LAM */
> 					context_switch()
> context_switch() /* to user thread */

First, it would be fine to either create a new kthread for reproduction
purposes or to hack an existing one.  For instance, have have the LAM
prctl() take an extra ref on the mm and stick it in a global variable:

	mmgrab(current->mm);
	global_mm = current->mm;

Then in the kthread, grab the mm and use it:

	while (!global_mm);
	kthread_use_mm(global_mm);
	... check for the race
	mmdrop(global_mm);

You can also hackily wait for thread to move with a stupid spin loop:

	while (smp_processor_id() != 1);

and then actually move it with sched_setaffinity() from userspace.  That
can make it easier to get that series of events to happen in lockstep.
Dave Hansen March 7, 2024, 9:48 p.m. UTC | #7
On 3/7/24 13:08, Yosry Ahmed wrote:
> CPU 1                                   CPU 2
> /* user thread running */
> context_switch() /* to kthread */
>                                         /* user thread enables LAM */
>                                         context_switch()
> context_switch() /* to user thread */
> 
> In this case, there are no races, but the second context switch on CPU 1
> may not write CR3 (if TLB is up-to-date), in which case we will run the
> user thread with CR3 having the wrong LAM mask. This could cause bigger
> problems, right?

Yes, but this is precisely the kind of thing that we should solve with
mm generations.  Its sole purpose is to thwart optimization where the
mm_prev==mm_next and might not rewrite CR3.
Yosry Ahmed March 7, 2024, 10:29 p.m. UTC | #8
On Thu, Mar 07, 2024 at 01:39:53PM -0800, Dave Hansen wrote:
> On 3/7/24 13:04, Yosry Ahmed wrote:
> > I thought about doing inc_mm_tlb_gen() when LAM is updated, but it felt
> > hacky and more importantly doesn't make it clear in switch_mm_irqs_off()
> > that we correctly handle LAM updates. We can certainly add a comment,
> > but I think an explicit check for CPU LAM vs. mm LAM is much clearer.
> > 
> > WDYT?
> 
> The mm generations are literally there so that if the mm changes that
> all the CPUs know they need an update.  Changing LAM enabling is 100%
> consistent with telling other CPUs that they need an update.
> 
> I'd be curious of Andy feels differently though.

The mm generations are TLB-specific and all the code using them implies
as such (e.g. look at the checks in switch_mm_irqs_off() when prev ==
next). We can go around and update comments and/or function names to
make them more generic, but this seems excessive. If we don't, the code
becomes less clear imo.

I agree that the use case here is essentially the same (let other
CPUs know they need to write CR3), but I still think that since the LAM
case is just a simple one-time enablement, an explicit check in
switch_mm_irqs_off() would be clearer.

Just my 2c, let me know what you prefer :)

> 
> >> Considering how fun this code path is, a little effort at an actual
> >> reproduction would be really appreciated.
> > 
> > I tried reproducing it but gave up quickly. We need a certain sequence
> > of events to happen:
> > 
> > CPU 1					CPU 2
> > kthread_use_mm()
> > 					/* user thread enables LAM */
> > 					context_switch()
> > context_switch() /* to user thread */
> 
> First, it would be fine to either create a new kthread for reproduction
> purposes or to hack an existing one.  For instance, have have the LAM
> prctl() take an extra ref on the mm and stick it in a global variable:
> 
> 	mmgrab(current->mm);
> 	global_mm = current->mm;
> 
> Then in the kthread, grab the mm and use it:
> 
> 	while (!global_mm);
> 	kthread_use_mm(global_mm);
> 	... check for the race
> 	mmdrop(global_mm);
> 
> You can also hackily wait for thread to move with a stupid spin loop:
> 
> 	while (smp_processor_id() != 1);
> 
> and then actually move it with sched_setaffinity() from userspace.  That
> can make it easier to get that series of events to happen in lockstep.

I will take a stab at doing something similar and let you know, thanks.
Yosry Ahmed March 7, 2024, 10:30 p.m. UTC | #9
On Thu, Mar 07, 2024 at 01:48:28PM -0800, Dave Hansen wrote:
> On 3/7/24 13:08, Yosry Ahmed wrote:
> > CPU 1                                   CPU 2
> > /* user thread running */
> > context_switch() /* to kthread */
> >                                         /* user thread enables LAM */
> >                                         context_switch()
> > context_switch() /* to user thread */
> > 
> > In this case, there are no races, but the second context switch on CPU 1
> > may not write CR3 (if TLB is up-to-date), in which case we will run the
> > user thread with CR3 having the wrong LAM mask. This could cause bigger
> > problems, right?
> 
> Yes, but this is precisely the kind of thing that we should solve with
> mm generations.  Its sole purpose is to thwart optimization where the
> mm_prev==mm_next and might not rewrite CR3.

I think it's clearer to have an explicit check for oudated LAM in
switch_mm_irqs_off(), as I mentioned in my other reply.
Dave Hansen March 7, 2024, 10:41 p.m. UTC | #10
On 3/7/24 14:29, Yosry Ahmed wrote:
> Just my 2c, let me know what you prefer 
Yosry Ahmed March 7, 2024, 10:44 p.m. UTC | #11
On Thu, Mar 7, 2024 at 2:41 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 3/7/24 14:29, Yosry Ahmed wrote:
> > Just my 2c, let me know what you prefer 
Yosry Ahmed March 8, 2024, 1:26 a.m. UTC | #12
On Thu, Mar 07, 2024 at 10:29:29PM +0000, Yosry Ahmed wrote:
> On Thu, Mar 07, 2024 at 01:39:53PM -0800, Dave Hansen wrote:
> > On 3/7/24 13:04, Yosry Ahmed wrote:
> > > I thought about doing inc_mm_tlb_gen() when LAM is updated, but it felt
> > > hacky and more importantly doesn't make it clear in switch_mm_irqs_off()
> > > that we correctly handle LAM updates. We can certainly add a comment,
> > > but I think an explicit check for CPU LAM vs. mm LAM is much clearer.
> > > 
> > > WDYT?
> > 
> > The mm generations are literally there so that if the mm changes that
> > all the CPUs know they need an update.  Changing LAM enabling is 100%
> > consistent with telling other CPUs that they need an update.
> > 
> > I'd be curious of Andy feels differently though.
> 
> The mm generations are TLB-specific and all the code using them implies
> as such (e.g. look at the checks in switch_mm_irqs_off() when prev ==
> next). We can go around and update comments and/or function names to
> make them more generic, but this seems excessive. If we don't, the code
> becomes less clear imo.
> 
> I agree that the use case here is essentially the same (let other
> CPUs know they need to write CR3), but I still think that since the LAM
> case is just a simple one-time enablement, an explicit check in
> switch_mm_irqs_off() would be clearer.
> 
> Just my 2c, let me know what you prefer :)
> 
> > 
> > >> Considering how fun this code path is, a little effort at an actual
> > >> reproduction would be really appreciated.
> > > 
> > > I tried reproducing it but gave up quickly. We need a certain sequence
> > > of events to happen:
> > > 
> > > CPU 1					CPU 2
> > > kthread_use_mm()
> > > 					/* user thread enables LAM */
> > > 					context_switch()
> > > context_switch() /* to user thread */
> > 
> > First, it would be fine to either create a new kthread for reproduction
> > purposes or to hack an existing one.  For instance, have have the LAM
> > prctl() take an extra ref on the mm and stick it in a global variable:
> > 
> > 	mmgrab(current->mm);
> > 	global_mm = current->mm;
> > 
> > Then in the kthread, grab the mm and use it:
> > 
> > 	while (!global_mm);
> > 	kthread_use_mm(global_mm);
> > 	... check for the race
> > 	mmdrop(global_mm);
> > 
> > You can also hackily wait for thread to move with a stupid spin loop:
> > 
> > 	while (smp_processor_id() != 1);
> > 
> > and then actually move it with sched_setaffinity() from userspace.  That
> > can make it easier to get that series of events to happen in lockstep.
> 
> I will take a stab at doing something similar and let you know, thanks.

I came up with a kernel patch that I *think* may reproduce the problem
with enough iterations. Userspace only needs to enable LAM, so I think
the selftest can be enough to trigger it.

However, there is no hardware with LAM at my disposal, and IIUC I cannot
use QEMU without KVM to run a kernel with LAM. I was planning to do more
testing before sending a non-RFC version, but apparently I cannot do
any testing beyond building at this point (including reproducing) :/

Let me know how you want to proceed. I can send a non-RFC v1 based on
the feedback I got on the RFC, but it will only be build tested.

For the record, here is the diff that I *think* may reproduce the bug:

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 33b268747bb7b..c37a8c26a3c21 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -750,8 +750,25 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
 
 #define LAM_U57_BITS 6
 
+static int kthread_fn(void *_mm)
+{
+	struct mm_struct *mm = _mm;
+
+	/*
+	 * Wait for LAM to be enabled then schedule. Hopefully we will context
+	 * switch directly into the task that enabled LAM due to CPU pinning.
+	 */
+	kthread_use_mm(mm);
+	while (!test_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags));
+	schedule();
+	return 0;
+}
+
 static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
 {
+	struct task_struct *kthread_task;
+	int kthread_cpu;
+
 	if (!cpu_feature_enabled(X86_FEATURE_LAM))
 		return -ENODEV;
 
@@ -782,10 +799,22 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
 		return -EINVAL;
 	}
 
+	/* Pin the task to the current CPU */
+	set_cpus_allowed_ptr(current, cpumask_of(smp_processor_id()));
+
+	/* Run a kthread on another CPU and wait for it to start */
+	kthread_cpu = cpumask_next_wrap(smp_processor_id(), cpu_online_mask, 0, false),
+	kthread_task = kthread_run_on_cpu(kthread_fn, mm, kthread_cpu, "lam_repro_kthread");
+	while (!task_is_running(kthread_task));
+
 	write_cr3(__read_cr3() | mm->context.lam_cr3_mask);
 	set_tlbstate_lam_mode(mm);
 	set_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags);
 
+	/* Move the task to the kthread CPU */
+	set_cpus_allowed_ptr(current, cpumask_of(kthread_cpu));
+
 	mmap_write_unlock(mm);
 
 	return 0;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 51f9f56941058..3afb53f1a1901 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -593,7 +593,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
 				next_tlb_gen)
-			return;
+			BUG_ON(new_lam != tlbstate_lam_cr3_mask());
 
 		/*
 		 * TLB contents went out of date while we were in lazy
Andy Lutomirski March 8, 2024, 1:34 a.m. UTC | #13
Catching up a bit...

On Thu, Mar 7, 2024, at 5:39 AM, Yosry Ahmed wrote:
> During context switching, if we are not switching to new mm and no TLB
> flush is needed, we do not write CR3. However, it is possible that a
> user thread enables LAM while a kthread is running on a different CPU
> with the old LAM CR3 mask. If the kthread context switches into any
> thread of that user process, it may not write CR3 with the new LAM mask,
> which would cause the user thread to run with a misconfigured CR3 that
> disables LAM on the CPU.

So I think (off the top of my head -- haven't thought about it all that hard) that LAM is logically like PCE and LDT: it's a property of an mm that is only rarely changed, and it doesn't really belong as part of the tlb_gen mechanism.  And, critically, it's not worth the effort and complexity to try to optimize LAM changes when we have a lazy CPU (just like PCE and LDT) (whereas TLB flushes are performance critical and are absolutely worth optimizing).

So...

>
> Fix this by making sure we write a new CR3 if LAM is not up-to-date. No
> problems were observed in practice, this was found by code inspection.

I think it should be fixed with a much bigger hammer: explicit IPIs.  Just don't ever let it get out of date, like install_ldt().

>
> Not that it is possible that mm->context.lam_cr3_mask changes throughout
> switch_mm_irqs_off(). But since LAM can only be enabled by a
> single-threaded process on its own behalf, in that case we cannot be
> switching to a user thread in that same process, we can only be
> switching to another kthread using the borrowed mm or a different user
> process, which should be fine.

The thought process is even simpler with the IPI: it *can* change while switching, but it will resynchronize immediately once IRQs turn back on.  And whoever changes it will *synchronize* with us, which would otherwise require extremely complex logic to get right.

And...

> -		if (!was_lazy)
> -			return;
> +		if (was_lazy) {
> +			/*
> +			 * Read the tlb_gen to check whether a flush is needed.
> +			 * If the TLB is up to date, just use it.  The barrier
> +			 * synchronizes with the tlb_gen increment in the TLB
> +			 * shootdown code.
> +			 */
> +			smp_mb();

This is actually rather expensive -- from old memory, we're talking maybe 20 cycles here, but this path is *very* hot and we try fairly hard to make it be fast.  If we get the happy PCID path, it's maybe 100-200 cycles, so this is like a 10% regression.  Ouch.

And you can delete all of this if you accept my suggestion.
Yosry Ahmed March 8, 2024, 1:47 a.m. UTC | #14
On Thu, Mar 07, 2024 at 05:34:21PM -0800, Andy Lutomirski wrote:
> Catching up a bit...
> 
> On Thu, Mar 7, 2024, at 5:39 AM, Yosry Ahmed wrote:
> > During context switching, if we are not switching to new mm and no TLB
> > flush is needed, we do not write CR3. However, it is possible that a
> > user thread enables LAM while a kthread is running on a different CPU
> > with the old LAM CR3 mask. If the kthread context switches into any
> > thread of that user process, it may not write CR3 with the new LAM mask,
> > which would cause the user thread to run with a misconfigured CR3 that
> > disables LAM on the CPU.
> 
> So I think (off the top of my head -- haven't thought about it all that hard) that LAM is logically like PCE and LDT: it's a property of an mm that is only rarely changed, and it doesn't really belong as part of the tlb_gen mechanism.  And, critically, it's not worth the effort and complexity to try to optimize LAM changes when we have a lazy CPU (just like PCE and LDT) (whereas TLB flushes are performance critical and are absolutely worth optimizing).
> 
> So...
> 
> >
> > Fix this by making sure we write a new CR3 if LAM is not up-to-date. No
> > problems were observed in practice, this was found by code inspection.
> 
> I think it should be fixed with a much bigger hammer: explicit IPIs.  Just don't ever let it get out of date, like install_ldt().

I like this, and I think earlier versions of the code did this. I think
the code now assumes it's fine to not send an IPI since only
single-threaded processes can enable LAM, but this means we have to
handle kthreads switching to user threads with outdated LAMs (what this
patch is trying to do).

I also think there is currently an assumption that it's fine for
kthreads to run with an incorrect LAM, which is mostly fine, but the IPI
also drops that assumption.

> 
> >
> > Not that it is possible that mm->context.lam_cr3_mask changes throughout
> > switch_mm_irqs_off(). But since LAM can only be enabled by a
> > single-threaded process on its own behalf, in that case we cannot be
> > switching to a user thread in that same process, we can only be
> > switching to another kthread using the borrowed mm or a different user
> > process, which should be fine.
> 
> The thought process is even simpler with the IPI: it *can* change while switching, but it will resynchronize immediately once IRQs turn back on.  And whoever changes it will *synchronize* with us, which would otherwise require extremely complex logic to get right.
> 
> And...
> 
> > -		if (!was_lazy)
> > -			return;
> > +		if (was_lazy) {
> > +			/*
> > +			 * Read the tlb_gen to check whether a flush is needed.
> > +			 * If the TLB is up to date, just use it.  The barrier
> > +			 * synchronizes with the tlb_gen increment in the TLB
> > +			 * shootdown code.
> > +			 */
> > +			smp_mb();
> 
> This is actually rather expensive -- from old memory, we're talking maybe 20 cycles here, but this path is *very* hot and we try fairly hard to make it be fast.  If we get the happy PCID path, it's maybe 100-200 cycles, so this is like a 10% regression.  Ouch.

This is not newly introduced by this patch. I merely refactored this
code (reversed the if conditions). I think if we keep the current
approach I should move this refactoring to a separate patch to make
things clearer.

> 
> And you can delete all of this if you accept my suggestion.

I like it very much. The problem now is, as I told Dave, I realized I
cannot do any testing beyond compilation due to lack of hardware. I am
happy to send a next version if this is acceptable or if someone else
can test.
Yosry Ahmed March 8, 2024, 8:09 a.m. UTC | #15
> I came up with a kernel patch that I *think* may reproduce the problem
> with enough iterations. Userspace only needs to enable LAM, so I think
> the selftest can be enough to trigger it.
> 
> However, there is no hardware with LAM at my disposal, and IIUC I cannot
> use QEMU without KVM to run a kernel with LAM. I was planning to do more
> testing before sending a non-RFC version, but apparently I cannot do
> any testing beyond building at this point (including reproducing) :/
> 
> Let me know how you want to proceed. I can send a non-RFC v1 based on
> the feedback I got on the RFC, but it will only be build tested.
> 
> For the record, here is the diff that I *think* may reproduce the bug:

Okay, I was actually able to run _some_ testing with the diff below on
_a kernel_, and I hit the BUG_ON pretty quickly. If I did things
correctly, this BUG_ON means that even though we have an outdated LAM in
our CR3, we will not update CR3 because the TLB is up-to-date.

I can work on a v1 now with the IPI approach that Andy suggested. A
small kink is that we may still hit the BUG_ON with that fix, but in
that case it should be fine to not write CR3 because once we re-enable
interrupts we will receive the IPI and fix it. IOW, the diff below will
still BUG with the proposed fix, but it should be okay.

One thing I am not clear about with the IPI approach, if we use
mm_cpumask() to limit the IPI scope, we need to make sure that we read
mm_lam_cr3_mask() *after* we update the cpumask in switch_mm_irqs_off(),
which makes me think we'll need a barrier (and Andy said we want to
avoid those in this path). But looking at the code I see:

		/*
		 * Start remote flushes and then read tlb_gen.
		 */
		if (next != &init_mm)
			cpumask_set_cpu(cpu, mm_cpumask(next));
		next_tlb_gen = atomic64_read(&next->context.tlb_gen);

This code doesn't have a barrier. How do we make sure the read actually
happens after the write?

If no barrier is needed there, then I think we can similarly just read
the LAM mask after cpumask_set_cpu().

> 
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 33b268747bb7b..c37a8c26a3c21 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -750,8 +750,25 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
>  
>  #define LAM_U57_BITS 6
>  
> +static int kthread_fn(void *_mm)
> +{
> +	struct mm_struct *mm = _mm;
> +
> +	/*
> +	 * Wait for LAM to be enabled then schedule. Hopefully we will context
> +	 * switch directly into the task that enabled LAM due to CPU pinning.
> +	 */
> +	kthread_use_mm(mm);
> +	while (!test_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags));
> +	schedule();
> +	return 0;
> +}
> +
>  static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
>  {
> +	struct task_struct *kthread_task;
> +	int kthread_cpu;
> +
>  	if (!cpu_feature_enabled(X86_FEATURE_LAM))
>  		return -ENODEV;
>  
> @@ -782,10 +799,22 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
>  		return -EINVAL;
>  	}
>  
> +	/* Pin the task to the current CPU */
> +	set_cpus_allowed_ptr(current, cpumask_of(smp_processor_id()));
> +
> +	/* Run a kthread on another CPU and wait for it to start */
> +	kthread_cpu = cpumask_next_wrap(smp_processor_id(), cpu_online_mask, 0, false),
> +	kthread_task = kthread_run_on_cpu(kthread_fn, mm, kthread_cpu, "lam_repro_kthread");
> +	while (!task_is_running(kthread_task));
> +
>  	write_cr3(__read_cr3() | mm->context.lam_cr3_mask);
>  	set_tlbstate_lam_mode(mm);
>  	set_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags);
>  
> +	/* Move the task to the kthread CPU */
> +	set_cpus_allowed_ptr(current, cpumask_of(kthread_cpu));
> +
>  	mmap_write_unlock(mm);
>  
>  	return 0;
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 51f9f56941058..3afb53f1a1901 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -593,7 +593,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
>  		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>  		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
>  				next_tlb_gen)
> -			return;
> +			BUG_ON(new_lam != tlbstate_lam_cr3_mask());
>  
>  		/*
>  		 * TLB contents went out of date while we were in lazy
>
Kirill A . Shutemov March 8, 2024, 2:05 p.m. UTC | #16
On Fri, Mar 08, 2024 at 01:47:39AM +0000, Yosry Ahmed wrote:
> I like it very much. The problem now is, as I told Dave, I realized I
> cannot do any testing beyond compilation due to lack of hardware. I am
> happy to send a next version if this is acceptable or if someone else
> can test.

I have non-upstreamable QEMU patch that adds LAM emulation, if it helps:

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 93b1ca810bf4..fe887a86a156 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1295,6 +1295,19 @@ void tlb_set_page(CPUState *cpu, vaddr addr,
                             prot, mmu_idx, size);
 }
 
+
+static vaddr clean_addr(CPUState *cpu, vaddr addr)
+{
+    CPUClass *cc = CPU_GET_CLASS(cpu);
+
+    if (cc->tcg_ops->do_clean_addr) {
+        addr = cc->tcg_ops->do_clean_addr(cpu, addr);
+    }
+
+    return addr;
+}
+
+
 /*
  * Note: tlb_fill() can trigger a resize of the TLB. This means that all of the
  * caller's prior references to the TLB table (e.g. CPUTLBEntry pointers) must
@@ -1867,9 +1880,10 @@ static bool mmu_lookup(CPUState *cpu, vaddr addr, MemOpIdx oi,
  * Probe for an atomic operation.  Do not allow unaligned operations,
  * or io operations to proceed.  Return the host address.
  */
-static void *atomic_mmu_lookup(CPUState *cpu, vaddr addr, MemOpIdx oi,
+static void *atomic_mmu_lookup(CPUState *cpu, vaddr address, MemOpIdx oi,
                                int size, uintptr_t retaddr)
 {
+    vaddr addr = clean_addr(cpu, address);
     uintptr_t mmu_idx = get_mmuidx(oi);
     MemOp mop = get_memop(oi);
     int a_bits = get_alignment_bits(mop);
@@ -2002,10 +2016,11 @@ static void *atomic_mmu_lookup(CPUState *cpu, vaddr addr, MemOpIdx oi,
  * The bytes are concatenated in big-endian order with @ret_be.
  */
 static uint64_t int_ld_mmio_beN(CPUState *cpu, CPUTLBEntryFull *full,
-                                uint64_t ret_be, vaddr addr, int size,
+                                uint64_t ret_be, vaddr address, int size,
                                 int mmu_idx, MMUAccessType type, uintptr_t ra,
                                 MemoryRegion *mr, hwaddr mr_offset)
 {
+    vaddr addr = clean_addr(cpu, address);
     do {
         MemOp this_mop;
         unsigned this_size;
@@ -2543,10 +2558,11 @@ static Int128 do_ld16_mmu(CPUState *cpu, vaddr addr,
  * return the bytes of @val_le beyond @p->size that have not been stored.
  */
 static uint64_t int_st_mmio_leN(CPUState *cpu, CPUTLBEntryFull *full,
-                                uint64_t val_le, vaddr addr, int size,
+                                uint64_t val_le, vaddr address, int size,
                                 int mmu_idx, uintptr_t ra,
                                 MemoryRegion *mr, hwaddr mr_offset)
 {
+    vaddr addr = clean_addr(cpu, address);
     do {
         MemOp this_mop;
         unsigned this_size;
diff --git a/include/hw/core/tcg-cpu-ops.h b/include/hw/core/tcg-cpu-ops.h
index bf8ff8e3eec1..eaa8e09a6226 100644
--- a/include/hw/core/tcg-cpu-ops.h
+++ b/include/hw/core/tcg-cpu-ops.h
@@ -140,6 +140,12 @@ struct TCGCPUOps {
                                            MMUAccessType access_type,
                                            int mmu_idx, uintptr_t retaddr);
 
+
+    /**
+     * @do_clean_addr: Callback for clearing metadata/tags from the address.
+     */
+    vaddr (*do_clean_addr)(CPUState *cpu, vaddr addr);
+
     /**
      * @adjust_watchpoint_address: hack for cpu_check_watchpoint used by ARM
      */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 2666ef380891..1bbfd31042b2 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -739,7 +739,7 @@ void x86_cpu_vendor_words2str(char *dst, uint32_t vendor1,
 #define TCG_7_0_EDX_FEATURES (CPUID_7_0_EDX_FSRM | CPUID_7_0_EDX_KERNEL_FEATURES)
 
 #define TCG_7_1_EAX_FEATURES (CPUID_7_1_EAX_FZRM | CPUID_7_1_EAX_FSRS | \
-          CPUID_7_1_EAX_FSRC | CPUID_7_1_EAX_CMPCCXADD)
+          CPUID_7_1_EAX_FSRC | CPUID_7_1_EAX_CMPCCXADD | CPUID_7_1_EAX_LAM)
 #define TCG_7_1_EDX_FEATURES 0
 #define TCG_7_2_EDX_FEATURES 0
 #define TCG_APM_FEATURES 0
@@ -968,7 +968,7 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
             "fsrc", NULL, NULL, NULL,
             NULL, NULL, NULL, NULL,
             NULL, "amx-fp16", NULL, "avx-ifma",
-            NULL, NULL, NULL, NULL,
+            NULL, NULL, "lam", NULL,
             NULL, NULL, NULL, NULL,
         },
         .cpuid = {
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..6ef9afd443b7 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -238,6 +238,9 @@ typedef enum X86Seg {
 #define CR0_CD_MASK  (1U << 30)
 #define CR0_PG_MASK  (1U << 31)
 
+#define CR3_LAM_U57  (1ULL << 61)
+#define CR3_LAM_U48  (1ULL << 62)
+
 #define CR4_VME_MASK  (1U << 0)
 #define CR4_PVI_MASK  (1U << 1)
 #define CR4_TSD_MASK  (1U << 2)
@@ -261,6 +264,7 @@ typedef enum X86Seg {
 #define CR4_SMAP_MASK   (1U << 21)
 #define CR4_PKE_MASK   (1U << 22)
 #define CR4_PKS_MASK   (1U << 24)
+#define CR4_LAM_SUP    (1U << 28)
 
 #define CR4_RESERVED_MASK \
 (~(target_ulong)(CR4_VME_MASK | CR4_PVI_MASK | CR4_TSD_MASK \
@@ -269,7 +273,8 @@ typedef enum X86Seg {
                 | CR4_OSFXSR_MASK | CR4_OSXMMEXCPT_MASK | CR4_UMIP_MASK \
                 | CR4_LA57_MASK \
                 | CR4_FSGSBASE_MASK | CR4_PCIDE_MASK | CR4_OSXSAVE_MASK \
-                | CR4_SMEP_MASK | CR4_SMAP_MASK | CR4_PKE_MASK | CR4_PKS_MASK))
+                | CR4_SMEP_MASK | CR4_SMAP_MASK | CR4_PKE_MASK | CR4_PKS_MASK \
+                | CR4_LAM_SUP ))
 
 #define DR6_BD          (1 << 13)
 #define DR6_BS          (1 << 14)
@@ -932,6 +937,8 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 #define CPUID_7_1_EAX_AMX_FP16          (1U << 21)
 /* Support for VPMADD52[H,L]UQ */
 #define CPUID_7_1_EAX_AVX_IFMA          (1U << 23)
+/* Linear Address Masking */
+#define CPUID_7_1_EAX_LAM               (1U << 26)
 
 /* Support for VPDPB[SU,UU,SS]D[,S] */
 #define CPUID_7_1_EDX_AVX_VNNI_INT8     (1U << 4)
@@ -2525,6 +2532,24 @@ static inline bool hyperv_feat_enabled(X86CPU *cpu, int feat)
     return !!(cpu->hyperv_features & BIT(feat));
 }
 
+static inline uint64_t cr3_reserved_bits(CPUX86State *env)
+{
+    uint64_t reserved_bits;
+
+    if (!(env->efer & MSR_EFER_LMA)) {
+        return 0;
+    }
+
+    reserved_bits = (~0ULL) << env_archcpu(env)->phys_bits;
+
+    if (env->features[FEAT_7_1_EAX] & CPUID_7_1_EAX_LAM) {
+        reserved_bits &= ~(CR3_LAM_U48 | CR3_LAM_U57);
+    }
+
+    return reserved_bits;
+}
+
+
 static inline uint64_t cr4_reserved_bits(CPUX86State *env)
 {
     uint64_t reserved_bits = CR4_RESERVED_MASK;
diff --git a/target/i386/helper.c b/target/i386/helper.c
index 2070dd0dda1f..4901c9c17b1e 100644
--- a/target/i386/helper.c
+++ b/target/i386/helper.c
@@ -262,7 +262,7 @@ hwaddr x86_cpu_get_phys_page_attrs_debug(CPUState *cs, vaddr addr,
             }
 
             if (la57) {
-                pml5e_addr = ((env->cr[3] & ~0xfff) +
+                pml5e_addr = ((env->cr[3] & PG_ADDRESS_MASK) +
                         (((addr >> 48) & 0x1ff) << 3)) & a20_mask;
                 pml5e = x86_ldq_phys(cs, pml5e_addr);
                 if (!(pml5e & PG_PRESENT_MASK)) {
diff --git a/target/i386/tcg/helper-tcg.h b/target/i386/tcg/helper-tcg.h
index effc2c1c9842..11f75ea475e3 100644
--- a/target/i386/tcg/helper-tcg.h
+++ b/target/i386/tcg/helper-tcg.h
@@ -84,6 +84,7 @@ bool x86_cpu_tlb_fill(CPUState *cs, vaddr address, int size,
 G_NORETURN void x86_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr,
                                             MMUAccessType access_type,
                                             int mmu_idx, uintptr_t retaddr);
+vaddr x86_cpu_clean_addr(CPUState *cpu, vaddr addr);
 #endif
 
 /* cc_helper.c */
diff --git a/target/i386/tcg/sysemu/excp_helper.c b/target/i386/tcg/sysemu/excp_helper.c
index 8f7011d96631..1bc71170e6a3 100644
--- a/target/i386/tcg/sysemu/excp_helper.c
+++ b/target/i386/tcg/sysemu/excp_helper.c
@@ -163,7 +163,7 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
                 /*
                  * Page table level 5
                  */
-                pte_addr = (in->cr3 & ~0xfff) + (((addr >> 48) & 0x1ff) << 3);
+                pte_addr = (in->cr3 & PG_ADDRESS_MASK) + (((addr >> 48) & 0x1ff) << 3);
                 if (!ptw_translate(&pte_trans, pte_addr)) {
                     return false;
                 }
@@ -638,3 +638,30 @@ G_NORETURN void x86_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr,
     X86CPU *cpu = X86_CPU(cs);
     handle_unaligned_access(&cpu->env, vaddr, access_type, retaddr);
 }
+
+
+static inline int64_t sign_extend64(uint64_t value, int index)
+{
+    int shift = 63 - index;
+    return (int64_t)(value << shift) >> shift;
+}
+
+vaddr x86_cpu_clean_addr(CPUState *cs, vaddr addr)
+{
+    CPUX86State *env = &X86_CPU(cs)->env;
+    bool la57 = env->cr[4] & CR4_LA57_MASK;
+
+    if (addr >> 63) {
+        if (env->cr[4] & CR4_LAM_SUP) {
+            return sign_extend64(addr, la57 ? 56 : 47);
+        }
+    } else {
+        if (env->cr[3] & CR3_LAM_U57) {
+            return sign_extend64(addr, 56);
+        } else if (env->cr[3] & CR3_LAM_U48) {
+            return sign_extend64(addr, 47);
+        }
+    }
+
+    return addr;
+}
diff --git a/target/i386/tcg/sysemu/misc_helper.c b/target/i386/tcg/sysemu/misc_helper.c
index edb7c3d89408..aecb523e777d 100644
--- a/target/i386/tcg/sysemu/misc_helper.c
+++ b/target/i386/tcg/sysemu/misc_helper.c
@@ -98,8 +98,7 @@ void helper_write_crN(CPUX86State *env, int reg, target_ulong t0)
         cpu_x86_update_cr0(env, t0);
         break;
     case 3:
-        if ((env->efer & MSR_EFER_LMA) &&
-                (t0 & ((~0ULL) << env_archcpu(env)->phys_bits))) {
+        if (t0 & cr3_reserved_bits(env)) {
             cpu_vmexit(env, SVM_EXIT_ERR, 0, GETPC());
         }
         if (!(env->efer & MSR_EFER_LMA)) {
diff --git a/target/i386/tcg/sysemu/svm_helper.c b/target/i386/tcg/sysemu/svm_helper.c
index 5d6de2294fa1..e981b124d975 100644
--- a/target/i386/tcg/sysemu/svm_helper.c
+++ b/target/i386/tcg/sysemu/svm_helper.c
@@ -305,8 +305,7 @@ void helper_vmrun(CPUX86State *env, int aflag, int next_eip_addend)
         cpu_vmexit(env, SVM_EXIT_ERR, 0, GETPC());
     }
     new_cr3 = x86_ldq_phys(cs, env->vm_vmcb + offsetof(struct vmcb, save.cr3));
-    if ((env->efer & MSR_EFER_LMA) &&
-            (new_cr3 & ((~0ULL) << cpu->phys_bits))) {
+    if (new_cr3 & cr3_reserved_bits(env)) {
         cpu_vmexit(env, SVM_EXIT_ERR, 0, GETPC());
     }
     new_cr4 = x86_ldq_phys(cs, env->vm_vmcb + offsetof(struct vmcb, save.cr4));
diff --git a/target/i386/tcg/tcg-cpu.c b/target/i386/tcg/tcg-cpu.c
index cca19cd40e81..8ceeb954364e 100644
--- a/target/i386/tcg/tcg-cpu.c
+++ b/target/i386/tcg/tcg-cpu.c
@@ -118,6 +118,7 @@ static const TCGCPUOps x86_tcg_ops = {
     .record_sigbus = x86_cpu_record_sigbus,
 #else
     .tlb_fill = x86_cpu_tlb_fill,
+    .do_clean_addr = x86_cpu_clean_addr,
     .do_interrupt = x86_cpu_do_interrupt,
     .cpu_exec_halt = x86_cpu_exec_halt,
     .cpu_exec_interrupt = x86_cpu_exec_interrupt,
Dave Hansen March 8, 2024, 3:23 p.m. UTC | #17
On 3/7/24 17:34, Andy Lutomirski wrote:
>> Fix this by making sure we write a new CR3 if LAM is not
>> up-to-date. No problems were observed in practice, this was found
>> by code inspection.
> I think it should be fixed with a much bigger hammer: explicit IPIs.
> Just don't ever let it get out of date, like install_ldt().
I guess it matters whether the thing that matters is having a persistent
inconsistency or a temporary one.  IPIs will definitely turn a permanent
one into a temporary one.

But this is all easier to reason about if we can get rid of even the
temporary inconsistency.

Wouldn't this be even simpler than IPIs?

static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm)
{
	unsigned long lam = READ_ONCE(mm->context.lam_cr3_mask);

+	/* LAM is for userspace only.  Ignore it for kernel threads: */
+	if (tsk->flags & PF_KTHREAD)
+		return 0;

	this_cpu_write(cpu_tlbstate.lam, lam >> X86_CR3_LAM_U57_BIT);
 	this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask);
	return lam;
}
Kirill A . Shutemov March 8, 2024, 6:18 p.m. UTC | #18
On Fri, Mar 08, 2024 at 07:23:58AM -0800, Dave Hansen wrote:
> On 3/7/24 17:34, Andy Lutomirski wrote:
> >> Fix this by making sure we write a new CR3 if LAM is not
> >> up-to-date. No problems were observed in practice, this was found
> >> by code inspection.
> > I think it should be fixed with a much bigger hammer: explicit IPIs.
> > Just don't ever let it get out of date, like install_ldt().
> I guess it matters whether the thing that matters is having a persistent
> inconsistency or a temporary one.  IPIs will definitely turn a permanent
> one into a temporary one.
> 
> But this is all easier to reason about if we can get rid of even the
> temporary inconsistency.
> 
> Wouldn't this be even simpler than IPIs?
> 
> static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm)
> {
> 	unsigned long lam = READ_ONCE(mm->context.lam_cr3_mask);
> 
> +	/* LAM is for userspace only.  Ignore it for kernel threads: */
> +	if (tsk->flags & PF_KTHREAD)
> +		return 0;

I like this approach. kthread_use_mm() WARNs if it called for
non-PF_KTHREAD task, so it should be okay.

I was worried that it would also exclude io_uring, but worker threads
don't have the flag set.
Yosry Ahmed March 9, 2024, 2:19 a.m. UTC | #19
On Fri, Mar 08, 2024 at 07:23:58AM -0800, Dave Hansen wrote:
> On 3/7/24 17:34, Andy Lutomirski wrote:
> >> Fix this by making sure we write a new CR3 if LAM is not
> >> up-to-date. No problems were observed in practice, this was found
> >> by code inspection.
> > I think it should be fixed with a much bigger hammer: explicit IPIs.
> > Just don't ever let it get out of date, like install_ldt().
> I guess it matters whether the thing that matters is having a persistent
> inconsistency or a temporary one.  IPIs will definitely turn a permanent
> one into a temporary one.
> 
> But this is all easier to reason about if we can get rid of even the
> temporary inconsistency.
> 
> Wouldn't this be even simpler than IPIs?
> 
> static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm)
> {
> 	unsigned long lam = READ_ONCE(mm->context.lam_cr3_mask);
> 
> +	/* LAM is for userspace only.  Ignore it for kernel threads: */
> +	if (tsk->flags & PF_KTHREAD)
> +		return 0;
> 
> 	this_cpu_write(cpu_tlbstate.lam, lam >> X86_CR3_LAM_U57_BIT);
>  	this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask);
> 	return lam;
> }

Hmm I don't see how this fixes the problem addressed by this patch. I
think this fixes the problem addressed by patch 1, where CR3 and
cpu_tlbstate.lam may get out of sync if LAM enablement races with
switch_mm_irqs_off().

However, this patch is fixing a deeper problem (an actual bug).
Precisely this situation:

CPU 1					CPU 2
/* kthread */
kthread_use_mm()
					/* user thread */
					prctl_enable_tagged_addr()
					/* LAM enabled */
					context_switch() /* to CPU 1 */
switch_mm_irqs_off()
/* user thread */
---> LAM is disabled here <---


When switch_mm_irqs_off() runs on CPU 1 to switch from the kthread to
the user thread, because the mm is not actually changing, we may not
write CR3. In this case, the user thread runs on CPU 1 with LAM
disabled, right?

The IPI would fix this problem because prctl_enable_tagged_addr() will
make sure that CPU 1 enables LAM before it returns to userspace.
Alternatively, this patch fixes the problem by making sure we write CR3
in switch_mm_irqs_off() if LAM is out-of-date.

I don't see how skipping set_tlbstate_lam_mode() for kthreads fixes this
problem. Do you mind elaborating?
Kirill A . Shutemov March 9, 2024, 4:34 p.m. UTC | #20
On Sat, Mar 09, 2024 at 02:19:19AM +0000, Yosry Ahmed wrote:
> I don't see how skipping set_tlbstate_lam_mode() for kthreads fixes this
> problem. Do you mind elaborating?

Define what problem is.

Yes, in this scenario kthread gets more permissive LAM mode than it needs.
But nothing breaks.
Yosry Ahmed March 9, 2024, 9:37 p.m. UTC | #21
On Sat, Mar 9, 2024 at 8:34 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Sat, Mar 09, 2024 at 02:19:19AM +0000, Yosry Ahmed wrote:
> > I don't see how skipping set_tlbstate_lam_mode() for kthreads fixes this
> > problem. Do you mind elaborating?
>
> Define what problem is.
>
> Yes, in this scenario kthread gets more permissive LAM mode than it needs.
> But nothing breaks.


The problem here is not how the kthread runs at all. It is the fact
that if that kthread context switches into the user process that has
enabled LAM, it may not update CR3 because the mm doesn't change.
switch_mm_irqs_off() will only update CR3 in this case if there is a
pending TLB flush. Otherwise, we just return, even if the LAM for this
mm has changed.

This can cause the process that has enabled LAM to run with LAM
disabled and fault on tagged addresses, right? Did I miss something?
Kirill A . Shutemov March 11, 2024, 12:42 p.m. UTC | #22
On Sat, Mar 09, 2024 at 01:37:06PM -0800, Yosry Ahmed wrote:
> On Sat, Mar 9, 2024 at 8:34 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Sat, Mar 09, 2024 at 02:19:19AM +0000, Yosry Ahmed wrote:
> > > I don't see how skipping set_tlbstate_lam_mode() for kthreads fixes this
> > > problem. Do you mind elaborating?
> >
> > Define what problem is.
> >
> > Yes, in this scenario kthread gets more permissive LAM mode than it needs.
> > But nothing breaks.
> 
> 
> The problem here is not how the kthread runs at all. It is the fact
> that if that kthread context switches into the user process that has
> enabled LAM, it may not update CR3 because the mm doesn't change.
> switch_mm_irqs_off() will only update CR3 in this case if there is a
> pending TLB flush. Otherwise, we just return, even if the LAM for this
> mm has changed.
> 
> This can cause the process that has enabled LAM to run with LAM
> disabled and fault on tagged addresses, right? Did I miss something?

You are right. I think IPI is the way to go.

Will you prepare a patch?
Yosry Ahmed March 11, 2024, 6:27 p.m. UTC | #23
On Mon, Mar 11, 2024 at 5:42 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Sat, Mar 09, 2024 at 01:37:06PM -0800, Yosry Ahmed wrote:
> > On Sat, Mar 9, 2024 at 8:34 AM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Sat, Mar 09, 2024 at 02:19:19AM +0000, Yosry Ahmed wrote:
> > > > I don't see how skipping set_tlbstate_lam_mode() for kthreads fixes this
> > > > problem. Do you mind elaborating?
> > >
> > > Define what problem is.
> > >
> > > Yes, in this scenario kthread gets more permissive LAM mode than it needs.
> > > But nothing breaks.
> >
> >
> > The problem here is not how the kthread runs at all. It is the fact
> > that if that kthread context switches into the user process that has
> > enabled LAM, it may not update CR3 because the mm doesn't change.
> > switch_mm_irqs_off() will only update CR3 in this case if there is a
> > pending TLB flush. Otherwise, we just return, even if the LAM for this
> > mm has changed.
> >
> > This can cause the process that has enabled LAM to run with LAM
> > disabled and fault on tagged addresses, right? Did I miss something?
>
> You are right. I think IPI is the way to go.
>
> Will you prepare a patch?

I am working on it.
diff mbox series

Patch

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2975d3f89a5de..3610c23499085 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -503,11 +503,12 @@  void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 {
 	struct mm_struct *prev = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	u64 cpu_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen);
 	bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
+	bool need_flush = false, need_lam_update = false;
 	unsigned cpu = smp_processor_id();
 	unsigned long new_lam;
 	u64 next_tlb_gen;
-	bool need_flush;
 	u16 new_asid;
 
 	/* We don't want flush_tlb_func() to run concurrently with us. */
@@ -570,32 +571,41 @@  void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
+		/*
+		 * tlbstate_lam_cr3_mask() may be outdated if a different thread
+		 * has enabled LAM while we were borrowing its mm on this CPU.
+		 * Make sure we update CR3 in case we are switching to another
+		 * thread in that process.
+		 */
+		if (tlbstate_lam_cr3_mask() != mm_lam_cr3_mask(next))
+			need_lam_update = true;
+
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
 		 * process. No TLB flush required.
 		 */
-		if (!was_lazy)
-			return;
+		if (was_lazy) {
+			/*
+			 * Read the tlb_gen to check whether a flush is needed.
+			 * If the TLB is up to date, just use it.  The barrier
+			 * synchronizes with the tlb_gen increment in the TLB
+			 * shootdown code.
+			 */
+			smp_mb();
+			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+			if (cpu_tlb_gen < next_tlb_gen) {
+				/*
+				 * TLB contents went out of date while we were
+				 * in lazy mode.
+				 */
+				new_asid = prev_asid;
+				need_flush = true;
+			}
+		}
 
-		/*
-		 * Read the tlb_gen to check whether a flush is needed.
-		 * If the TLB is up to date, just use it.
-		 * The barrier synchronizes with the tlb_gen increment in
-		 * the TLB shootdown code.
-		 */
-		smp_mb();
-		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
-		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
-				next_tlb_gen)
+		if (!need_flush && !need_lam_update)
 			return;
-
-		/*
-		 * TLB contents went out of date while we were in lazy
-		 * mode. Fall through to the TLB switching code below.
-		 */
-		new_asid = prev_asid;
-		need_flush = true;
 	} else {
 		/*
 		 * Apply process to process speculation vulnerability