[PATCHv3,6/8] x86/mm: Provide ARCH_GET_UNTAG_MASK and ARCH_ENABLE_TAGGED_ADDR

Message ID	20220610143527.22974-7-kirill.shutemov@linux.intel.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> To: Dave Hansen <dave.hansen@linux.intel.com>, Andy Lutomirski <luto@kernel.org>, Peter Zijlstra <peterz@infradead.org> Cc: x86@kernel.org, Kostya Serebryany <kcc@google.com>, Andrey Ryabinin <ryabinin.a.a@gmail.com>, Andrey Konovalov <andreyknvl@gmail.com>, Alexander Potapenko <glider@google.com>, Dmitry Vyukov <dvyukov@google.com>, "H . J . Lu" <hjl.tools@gmail.com>, Andi Kleen <ak@linux.intel.com>, Rick Edgecombe <rick.p.edgecombe@intel.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Subject: [PATCHv3 6/8] x86/mm: Provide ARCH_GET_UNTAG_MASK and ARCH_ENABLE_TAGGED_ADDR Date: Fri, 10 Jun 2022 17:35:25 +0300 Message-Id: <20220610143527.22974-7-kirill.shutemov@linux.intel.com> In-Reply-To: <20220610143527.22974-1-kirill.shutemov@linux.intel.com> References: <20220610143527.22974-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Linear Address Masking enabling \| expand [PATCHv3,0/8] Linear Address Masking enabling [PATCHv3,1/8] x86/mm: Fix CR3_ADDR_MASK [PATCHv3,2/8] x86: CPUID and CR3/CR4 flags for Linear Address Masking [PATCHv3,3/8] mm: Pass down mm_struct to untagged_addr() [PATCHv3,4/8] x86/mm: Handle LAM on context switch [PATCHv3,5/8] x86/uaccess: Provide untagged_addr() and remove tags before address check [PATCHv3,6/8] x86/mm: Provide ARCH_GET_UNTAG_MASK and ARCH_ENABLE_TAGGED_ADDR [PATCHv3,7/8] x86: Expose untagging mask in /proc/$PID/arch_status [PATCHv3,OPTIONAL,8/8] x86/mm: Extend LAM to support to LAM_U48

kirill.shutemov@linux.intel.com June 10, 2022, 2:35 p.m. UTC

Add a couple of arch_prctl() handles:

 - ARCH_ENABLE_TAGGED_ADDR enabled LAM. The argument is required number
   of tag bits. It is rounded up to the nearest LAM mode that can
   provide it. For now only LAM_U57 is supported, with 6 tag bits.

 - ARCH_GET_UNTAG_MASK returns untag mask. It can indicates where tag
   bits located in the address.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/uapi/asm/prctl.h |  3 +++
 arch/x86/kernel/process_64.c      | 32 ++++++++++++++++++++++++++++++-
 2 files changed, 34 insertions(+), 1 deletion(-)

Edgecombe, Rick P June 10, 2022, 3:25 p.m. UTC | #1

On Fri, 2022-06-10 at 17:35 +0300, Kirill A. Shutemov wrote:
> +static int prctl_enable_tagged_addr(unsigned long nr_bits)
> +{
> +       struct mm_struct *mm = current->mm;

do_arch_prctl_64() can be called via ptrace. I think you need to
operate on the mm of 'task', or just block the operation if task !=
current.

> +
> +       /* Already enabled? */
> +       if (mm->context.lam_cr3_mask)
> +               return -EBUSY;
> +
> +       /* LAM has to be enabled before spawning threads */
> +       if (get_nr_threads(current) > 1)
> +               return -EBUSY;
> +
> +       if (!nr_bits) {
> +               return -EINVAL;
> +       } else if (nr_bits <= 6) {
> +               mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
> +               mm->context.untag_mask =  ~GENMASK(62, 57);
> +       } else {
> +               return -EINVAL;
> +       }
> +
> +       /* Update CR3 to get LAM active */
> +       switch_mm(current->mm, current->mm, current);
> +       return 0;
> +}
> +
>  long do_arch_prctl_64(struct task_struct *task, int option, unsigned
> long arg2)
>  {
>         int ret = 0;
> @@ -829,7 +855,11 @@ long do_arch_prctl_64(struct task_struct *task,
> int option, unsigned long arg2)
>         case ARCH_MAP_VDSO_64:
>                 return prctl_map_vdso(&vdso_image_64, arg2);
>  #endif
> -
> +       case ARCH_GET_UNTAG_MASK:
> +               return put_user(current->mm->context.untag_mask,
> +                               (unsigned long __user *)arg2);
> +       case ARCH_ENABLE_TAGGED_ADDR:
> +               return prctl_enable_tagged_addr(arg2);
>         default:
>                 ret = -EINVAL;
>                 break;

Edgecombe, Rick P June 10, 2022, 4:16 p.m. UTC | #2

On Fri, 2022-06-10 at 17:35 +0300, Kirill A. Shutemov wrote:
> +static int prctl_enable_tagged_addr(unsigned long nr_bits)
> +{
> +       struct mm_struct *mm = current->mm;
> +
> +       /* Already enabled? */
> +       if (mm->context.lam_cr3_mask)
> +               return -EBUSY;
> +
> +       /* LAM has to be enabled before spawning threads */
> +       if (get_nr_threads(current) > 1)
> +               return -EBUSY;

Does this work for vfork()? I guess the idea is that locking is not
needed below because there is only one thread with the MM, but with
vfork() another task could operate on the MM, call fork(), etc. I'm not
sure...

> +
> +       if (!nr_bits) {
> +               return -EINVAL;
> +       } else if (nr_bits <= 6) {
> +               mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
> +               mm->context.untag_mask =  ~GENMASK(62, 57);
> +       } else {
> +               return -EINVAL;
> +       }
> +
> +       /* Update CR3 to get LAM active */
> +       switch_mm(current->mm, current->mm, current);
> +       return 0;
> +}

kirill.shutemov@linux.intel.com June 10, 2022, 6:04 p.m. UTC | #3

On Fri, Jun 10, 2022 at 03:25:02PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2022-06-10 at 17:35 +0300, Kirill A. Shutemov wrote:
> > +static int prctl_enable_tagged_addr(unsigned long nr_bits)
> > +{
> > +       struct mm_struct *mm = current->mm;
> 
> do_arch_prctl_64() can be called via ptrace. I think you need to
> operate on the mm of 'task', or just block the operation if task !=
> current.

Hm. True. Let's see if the interface in general good enough.

Ouch. I just noticied that I missed check for LAM feature :/

kirill.shutemov@linux.intel.com June 10, 2022, 6:06 p.m. UTC | #4

On Fri, Jun 10, 2022 at 04:16:01PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2022-06-10 at 17:35 +0300, Kirill A. Shutemov wrote:
> > +static int prctl_enable_tagged_addr(unsigned long nr_bits)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +
> > +       /* Already enabled? */
> > +       if (mm->context.lam_cr3_mask)
> > +               return -EBUSY;
> > +
> > +       /* LAM has to be enabled before spawning threads */
> > +       if (get_nr_threads(current) > 1)
> > +               return -EBUSY;
> 
> Does this work for vfork()? I guess the idea is that locking is not
> needed below because there is only one thread with the MM, but with
> vfork() another task could operate on the MM, call fork(), etc. I'm not
> sure...

I'm not sure I follow. vfork() blocks parent process until child exit or
execve(). I don't see how it is a problem.

Edgecombe, Rick P June 10, 2022, 6:08 p.m. UTC | #5

On Fri, 2022-06-10 at 21:06 +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 10, 2022 at 04:16:01PM +0000, Edgecombe, Rick P wrote:
> > On Fri, 2022-06-10 at 17:35 +0300, Kirill A. Shutemov wrote:
> > > +static int prctl_enable_tagged_addr(unsigned long nr_bits)
> > > +{
> > > +       struct mm_struct *mm = current->mm;
> > > +
> > > +       /* Already enabled? */
> > > +       if (mm->context.lam_cr3_mask)
> > > +               return -EBUSY;
> > > +
> > > +       /* LAM has to be enabled before spawning threads */
> > > +       if (get_nr_threads(current) > 1)
> > > +               return -EBUSY;
> > 
> > Does this work for vfork()? I guess the idea is that locking is not
> > needed below because there is only one thread with the MM, but with
> > vfork() another task could operate on the MM, call fork(), etc. I'm
> > not
> > sure...
> 
> I'm not sure I follow. vfork() blocks parent process until child exit
> or
> execve(). I don't see how it is a problem.

Oh yea, you're right.

Edgecombe, Rick P June 10, 2022, 10:18 p.m. UTC | #6

On Fri, 2022-06-10 at 11:08 -0700, Edgecombe, Richard P wrote:
> On Fri, 2022-06-10 at 21:06 +0300, Kirill A. Shutemov wrote:
> > On Fri, Jun 10, 2022 at 04:16:01PM +0000, Edgecombe, Rick P wrote:
> > > On Fri, 2022-06-10 at 17:35 +0300, Kirill A. Shutemov wrote:
> > > > +static int prctl_enable_tagged_addr(unsigned long nr_bits)
> > > > +{
> > > > +       struct mm_struct *mm = current->mm;
> > > > +
> > > > +       /* Already enabled? */
> > > > +       if (mm->context.lam_cr3_mask)
> > > > +               return -EBUSY;
> > > > +
> > > > +       /* LAM has to be enabled before spawning threads */
> > > > +       if (get_nr_threads(current) > 1)
> > > > +               return -EBUSY;
> > > 
> > > Does this work for vfork()? I guess the idea is that locking is
> > > not
> > > needed below because there is only one thread with the MM, but
> > > with
> > > vfork() another task could operate on the MM, call fork(), etc.
> > > I'm
> > > not
> > > sure...
> > 
> > I'm not sure I follow. vfork() blocks parent process until child
> > exit
> > or
> > execve(). I don't see how it is a problem.
> 
> Oh yea, you're right.

Actually, I guess vfork() only suspends the calling thread. So what if
you had:
1. Parent spawns a bunch of threads
2. vforks()
3. Child enables LAM (it only has one thread, so succeeds)
4. Child exits()
5. Parent has some threads with LAM, and some not

It's some weird userspace that doesn't deserve to have things work for
it, but I wonder if it could open up little races around untagging. As
an example, KVM might have a super narrow race where it checks for tags
in memslots using addr != untagged_addr(addr) before checking
access_ok(addr, ...). See __kvm_set_memory_region(). If mm-
>context.untag_mask got set in the middle, tagged memslots could be
added.

kirill.shutemov@linux.intel.com June 11, 2022, 1:12 a.m. UTC | #7

On Fri, Jun 10, 2022 at 10:18:23PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2022-06-10 at 11:08 -0700, Edgecombe, Richard P wrote:
> > On Fri, 2022-06-10 at 21:06 +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jun 10, 2022 at 04:16:01PM +0000, Edgecombe, Rick P wrote:
> > > > On Fri, 2022-06-10 at 17:35 +0300, Kirill A. Shutemov wrote:
> > > > > +static int prctl_enable_tagged_addr(unsigned long nr_bits)
> > > > > +{
> > > > > +       struct mm_struct *mm = current->mm;
> > > > > +
> > > > > +       /* Already enabled? */
> > > > > +       if (mm->context.lam_cr3_mask)
> > > > > +               return -EBUSY;
> > > > > +
> > > > > +       /* LAM has to be enabled before spawning threads */
> > > > > +       if (get_nr_threads(current) > 1)
> > > > > +               return -EBUSY;
> > > > 
> > > > Does this work for vfork()? I guess the idea is that locking is
> > > > not
> > > > needed below because there is only one thread with the MM, but
> > > > with
> > > > vfork() another task could operate on the MM, call fork(), etc.
> > > > I'm
> > > > not
> > > > sure...
> > > 
> > > I'm not sure I follow. vfork() blocks parent process until child
> > > exit
> > > or
> > > execve(). I don't see how it is a problem.
> > 
> > Oh yea, you're right.
> 
> Actually, I guess vfork() only suspends the calling thread. So what if
> you had:
> 1. Parent spawns a bunch of threads
> 2. vforks()
> 3. Child enables LAM (it only has one thread, so succeeds)
> 4. Child exits()
> 5. Parent has some threads with LAM, and some not

I think it is in "Don't do that" territory. It is very similar to cases
described in "Caveats" section of the vfork(2) man-page.

> It's some weird userspace that doesn't deserve to have things work for
> it, but I wonder if it could open up little races around untagging. As
> an example, KVM might have a super narrow race where it checks for tags
> in memslots using addr != untagged_addr(addr) before checking
> access_ok(addr, ...). See __kvm_set_memory_region(). If mm-
> >context.untag_mask got set in the middle, tagged memslots could be
> added.

Ultimately, a process which calls vfork(2) is in control of what happens
to the new process until execve(2) or exit(2). So, yes it is very creative
way to shoot yourself into leg, but I don't think it worth preventing.

And I'm not sure how the fix would look like.

Edgecombe, Rick P June 11, 2022, 2:36 a.m. UTC | #8

On Sat, 2022-06-11 at 04:12 +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 10, 2022 at 10:18:23PM +0000, Edgecombe, Rick P wrote:
> > On Fri, 2022-06-10 at 11:08 -0700, Edgecombe, Richard P wrote:
> > > On Fri, 2022-06-10 at 21:06 +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Jun 10, 2022 at 04:16:01PM +0000, Edgecombe, Rick P
> > > > wrote:
> > > > > On Fri, 2022-06-10 at 17:35 +0300, Kirill A. Shutemov wrote:
> > > > > > +static int prctl_enable_tagged_addr(unsigned long nr_bits)
> > > > > > +{
> > > > > > +       struct mm_struct *mm = current->mm;
> > > > > > +
> > > > > > +       /* Already enabled? */
> > > > > > +       if (mm->context.lam_cr3_mask)
> > > > > > +               return -EBUSY;
> > > > > > +
> > > > > > +       /* LAM has to be enabled before spawning threads */
> > > > > > +       if (get_nr_threads(current) > 1)
> > > > > > +               return -EBUSY;
> > > > > 
> > > > > Does this work for vfork()? I guess the idea is that locking
> > > > > is
> > > > > not
> > > > > needed below because there is only one thread with the MM,
> > > > > but
> > > > > with
> > > > > vfork() another task could operate on the MM, call fork(),
> > > > > etc.
> > > > > I'm
> > > > > not
> > > > > sure...
> > > > 
> > > > I'm not sure I follow. vfork() blocks parent process until
> > > > child
> > > > exit
> > > > or
> > > > execve(). I don't see how it is a problem.
> > > 
> > > Oh yea, you're right.
> > 
> > Actually, I guess vfork() only suspends the calling thread. So what
> > if
> > you had:
> > 1. Parent spawns a bunch of threads
> > 2. vforks()
> > 3. Child enables LAM (it only has one thread, so succeeds)
> > 4. Child exits()
> > 5. Parent has some threads with LAM, and some not
> 
> I think it is in "Don't do that" territory. It is very similar to
> cases
> described in "Caveats" section of the vfork(2) man-page.
> 
> > It's some weird userspace that doesn't deserve to have things work
> > for
> > it, but I wonder if it could open up little races around untagging.
> > As
> > an example, KVM might have a super narrow race where it checks for
> > tags
> > in memslots using addr != untagged_addr(addr) before checking
> > access_ok(addr, ...). See __kvm_set_memory_region(). If mm-
> > > context.untag_mask got set in the middle, tagged memslots could
> > > be
> > 
> > added.
> 
> Ultimately, a process which calls vfork(2) is in control of what
> happens
> to the new process until execve(2) or exit(2). So, yes it is very
> creative
> way to shoot yourself into leg, but I don't think it worth
> preventing.
> 
> And I'm not sure how the fix would look like.

Yea, userspace shooting itself in the foot is fine. You would really
have to go out of your way to do that. But my concern is that it will
expose the kernel. The KVM scenario I outlined is a narrow race, but it
lets guests write to freed pages. So the "not first thread enabling"
seems like a generally fragile thing.

I don't know how to fix it, but I think enabling LAM seems fraught and
should be contained strictly to MMs with one thread.

I'm not sure, but what about using in_vfork()?

Andy Lutomirski June 12, 2022, 9:03 p.m. UTC | #9

On Fri, Jun 10, 2022, at 3:18 PM, Edgecombe, Rick P wrote:
> On Fri, 2022-06-10 at 11:08 -0700, Edgecombe, Richard P wrote:
>> On Fri, 2022-06-10 at 21:06 +0300, Kirill A. Shutemov wrote:
>> > On Fri, Jun 10, 2022 at 04:16:01PM +0000, Edgecombe, Rick P wrote:
>> > > On Fri, 2022-06-10 at 17:35 +0300, Kirill A. Shutemov wrote:
>> > > > +static int prctl_enable_tagged_addr(unsigned long nr_bits)
>> > > > +{
>> > > > +       struct mm_struct *mm = current->mm;
>> > > > +
>> > > > +       /* Already enabled? */
>> > > > +       if (mm->context.lam_cr3_mask)
>> > > > +               return -EBUSY;
>> > > > +
>> > > > +       /* LAM has to be enabled before spawning threads */
>> > > > +       if (get_nr_threads(current) > 1)
>> > > > +               return -EBUSY;
>> > > 
>> > > Does this work for vfork()? I guess the idea is that locking is
>> > > not
>> > > needed below because there is only one thread with the MM, but
>> > > with
>> > > vfork() another task could operate on the MM, call fork(), etc.
>> > > I'm
>> > > not
>> > > sure...
>> > 
>> > I'm not sure I follow. vfork() blocks parent process until child
>> > exit
>> > or
>> > execve(). I don't see how it is a problem.
>> 
>> Oh yea, you're right.
>
> Actually, I guess vfork() only suspends the calling thread. So what if
> you had:
> 1. Parent spawns a bunch of threads
> 2. vforks()
> 3. Child enables LAM (it only has one thread, so succeeds)
> 4. Child exits()
> 5. Parent has some threads with LAM, and some not
>
> It's some weird userspace that doesn't deserve to have things work for
> it, but I wonder if it could open up little races around untagging. As
> an example, KVM might have a super narrow race where it checks for tags
> in memslots using addr != untagged_addr(addr) before checking
> access_ok(addr, ...). See __kvm_set_memory_region(). If mm-
>>context.untag_mask got set in the middle, tagged memslots could be
> added.

get_nr_threads() is the wrong thing.  Either look at mm->mm_users or find a way to get rid of this restriction entirely.

IMO it would not be insane to have a way to iterate over all tasks using an mm.  But doing this for io_uring, etc might be interesting.

Michal Hocko June 13, 2022, 2:42 p.m. UTC | #10

On Fri 10-06-22 17:35:25, Kirill A. Shutemov wrote:
[...]
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 1962008fe743..93c8eba1a66d 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -742,6 +742,32 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
>  }
>  #endif
>  
> +static int prctl_enable_tagged_addr(unsigned long nr_bits)
> +{
> +	struct mm_struct *mm = current->mm;
> +
> +	/* Already enabled? */
> +	if (mm->context.lam_cr3_mask)
> +		return -EBUSY;
> +
> +	/* LAM has to be enabled before spawning threads */
> +	if (get_nr_threads(current) > 1)
> +		return -EBUSY;

This will not be sufficient in general. You can have mm shared with a
process without CLONE_THREAD. So you would also need to check also
MMF_MULTIPROCESS. But I do remember that general get_nr_threads is quite
tricky to use properly. Make sure to CC Oleg Nesterov for more details.

Also how does this work when the mm is shared with a kernel thread?

> +
> +	if (!nr_bits) {
> +		return -EINVAL;
> +	} else if (nr_bits <= 6) {
> +		mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
> +		mm->context.untag_mask =  ~GENMASK(62, 57);
> +	} else {
> +		return -EINVAL;
> +	}
> +
> +	/* Update CR3 to get LAM active */
> +	switch_mm(current->mm, current->mm, current);
> +	return 0;
> +}

Peter Zijlstra June 16, 2022, 9:39 a.m. UTC | #11

On Fri, Jun 10, 2022 at 05:35:25PM +0300, Kirill A. Shutemov wrote:
> Add a couple of arch_prctl() handles:
> 
>  - ARCH_ENABLE_TAGGED_ADDR enabled LAM. The argument is required number
>    of tag bits. It is rounded up to the nearest LAM mode that can
>    provide it. For now only LAM_U57 is supported, with 6 tag bits.

As stated in the reply to the other patch; this isn't a 'for now' thing.
Due to how the untag thing works. Supporting U48 on 5-level kernels is
going to require a more complicated untag.

Peter Zijlstra June 16, 2022, 9:44 a.m. UTC | #12

On Sun, Jun 12, 2022 at 02:03:43PM -0700, Andy Lutomirski wrote:

> >> > > > +       /* LAM has to be enabled before spawning threads */
> >> > > > +       if (get_nr_threads(current) > 1)
> >> > > > +               return -EBUSY;

> >> > > Does this work for vfork()? I guess the idea is that locking is

vfork() isn't the problem, the problem is that Linux allows CLONE_VM
without CLONE_THREAD. Now, mostly nobody does that these days, but it is
possible.

> get_nr_threads() is the wrong thing.  Either look at mm->mm_users or
> find a way to get rid of this restriction entirely.

mm->mm_users should indeed be sufficient here.

> IMO it would not be insane to have a way to iterate over all tasks
> using an mm.  But doing this for io_uring, etc might be interesting.

That has come up so often over the past 15+ years I've no idea how come
we've still not managed to actually do that ;-)

kirill.shutemov@linux.intel.com June 16, 2022, 4:54 p.m. UTC | #13

On Thu, Jun 16, 2022 at 11:44:59AM +0200, Peter Zijlstra wrote:
> > get_nr_threads() is the wrong thing.  Either look at mm->mm_users or
> > find a way to get rid of this restriction entirely.
> 
> mm->mm_users should indeed be sufficient here.

Hm. kthread_use_mm() doesn't bump mm_users. Do we care?

kirill.shutemov@linux.intel.com June 16, 2022, 5:05 p.m. UTC | #14

On Mon, Jun 13, 2022 at 04:42:57PM +0200, Michal Hocko wrote:
> On Fri 10-06-22 17:35:25, Kirill A. Shutemov wrote:
> [...]
> > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > index 1962008fe743..93c8eba1a66d 100644
> > --- a/arch/x86/kernel/process_64.c
> > +++ b/arch/x86/kernel/process_64.c
> > @@ -742,6 +742,32 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
> >  }
> >  #endif
> >  
> > +static int prctl_enable_tagged_addr(unsigned long nr_bits)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +
> > +	/* Already enabled? */
> > +	if (mm->context.lam_cr3_mask)
> > +		return -EBUSY;
> > +
> > +	/* LAM has to be enabled before spawning threads */
> > +	if (get_nr_threads(current) > 1)
> > +		return -EBUSY;
> 
> This will not be sufficient in general. You can have mm shared with a
> process without CLONE_THREAD. So you would also need to check also
> MMF_MULTIPROCESS. But I do remember that general get_nr_threads is quite
> tricky to use properly. Make sure to CC Oleg Nesterov for more details.
> 
> Also how does this work when the mm is shared with a kernel thread?

It seems we need to check mm_count to exclude kernel threads that use the
mm. But I expect it to produce bunch of false-positives.

Or we can make all CPUs to do

	switch_mm(current->mm, current->mm, current);

and get LAM bits updated regardless what mm it runs. It would also remove
limitation that LAM can only be enabled when there's no threads.

But I feel that is a bad idea, but I have no clue why. :P

Kirill A . Shutemov June 19, 2022, 11:40 p.m. UTC | #15

On Thu, Jun 16, 2022 at 08:05:10PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jun 13, 2022 at 04:42:57PM +0200, Michal Hocko wrote:
> > On Fri 10-06-22 17:35:25, Kirill A. Shutemov wrote:
> > [...]
> > > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > > index 1962008fe743..93c8eba1a66d 100644
> > > --- a/arch/x86/kernel/process_64.c
> > > +++ b/arch/x86/kernel/process_64.c
> > > @@ -742,6 +742,32 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
> > >  }
> > >  #endif
> > >  
> > > +static int prctl_enable_tagged_addr(unsigned long nr_bits)
> > > +{
> > > +	struct mm_struct *mm = current->mm;
> > > +
> > > +	/* Already enabled? */
> > > +	if (mm->context.lam_cr3_mask)
> > > +		return -EBUSY;
> > > +
> > > +	/* LAM has to be enabled before spawning threads */
> > > +	if (get_nr_threads(current) > 1)
> > > +		return -EBUSY;
> > 
> > This will not be sufficient in general. You can have mm shared with a
> > process without CLONE_THREAD. So you would also need to check also
> > MMF_MULTIPROCESS. But I do remember that general get_nr_threads is quite
> > tricky to use properly. Make sure to CC Oleg Nesterov for more details.
> > 
> > Also how does this work when the mm is shared with a kernel thread?
> 
> It seems we need to check mm_count to exclude kernel threads that use the
> mm. But I expect it to produce bunch of false-positives.
> 
> Or we can make all CPUs to do
> 
> 	switch_mm(current->mm, current->mm, current);
> 
> and get LAM bits updated regardless what mm it runs. It would also remove
> limitation that LAM can only be enabled when there's no threads.
> 
> But I feel that is a bad idea, but I have no clue why. :P

Below is what I meant. Maybe it's not that bad. I donno.

Any opinions?

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 56822d313b96..69e6b11efa62 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -752,6 +752,16 @@ static bool lam_u48_allowed(void)
 	return find_vma(mm, DEFAULT_MAP_WINDOW) == NULL;
 }
 
+static void enable_lam_func(void *mm)
+{
+	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+
+	if (loaded_mm != mm)
+		return;
+
+	switch_mm(loaded_mm, loaded_mm, current);
+}
+
 static int prctl_enable_tagged_addr(unsigned long nr_bits)
 {
 	struct mm_struct *mm = current->mm;
@@ -760,10 +770,6 @@ static int prctl_enable_tagged_addr(unsigned long nr_bits)
 	if (mm->context.lam_cr3_mask)
 		return -EBUSY;
 
-	/* LAM has to be enabled before spawning threads */
-	if (get_nr_threads(current) > 1)
-		return -EBUSY;
-
 	if (!nr_bits) {
 		return -EINVAL;
 	} else if (nr_bits <= 6) {
@@ -785,8 +791,8 @@ static int prctl_enable_tagged_addr(unsigned long nr_bits)
 		return -EINVAL;
 	}
 
-	/* Update CR3 to get LAM active */
-	switch_mm(current->mm, current->mm, current);
+	on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true);
+
 	return 0;
 }

Andy Lutomirski June 28, 2022, 11:42 p.m. UTC | #16

On 6/10/22 07:35, Kirill A. Shutemov wrote:

> +	/* Update CR3 to get LAM active */
> +	switch_mm(current->mm, current->mm, current);

Can you at least justify this oddity?  When changing an LDT, we use a 
dedicated mechanism.  Is there a significant benefit to abusing 
switch_mm for this?

Also, why can't we enable LAM on a multithreaded process?  We can change 
an LDT, and the code isn't even particularly complicated.

kirill.shutemov@linux.intel.com June 29, 2022, 12:53 a.m. UTC | #17

On Tue, Jun 28, 2022 at 04:42:40PM -0700, Andy Lutomirski wrote:
> On 6/10/22 07:35, Kirill A. Shutemov wrote:
> 
> > +	/* Update CR3 to get LAM active */
> > +	switch_mm(current->mm, current->mm, current);
> 
> Can you at least justify this oddity?  When changing an LDT, we use a
> dedicated mechanism.  Is there a significant benefit to abusing switch_mm
> for this?

I'm not sure I follow. LAM mode is set in CR3. switch_mm() has to handle
it anyway to context switch. Why do you consider it abuse?

> 
> Also, why can't we enable LAM on a multithreaded process?  We can change an
> LDT, and the code isn't even particularly complicated.

I reworked this in v4[1] and it allows multithreaded processes. Have you
got that version?

Intel had issue with mail server, but I assumed it didn't affect my
patchset since I see it in the archive.

[1] https://lore.kernel.org/all/20220622162230.83474-1-kirill.shutemov@linux.intel.com/

Andy Lutomirski June 30, 2022, 2:04 a.m. UTC | #18

On Thu, Jun 16, 2022, at 9:54 AM, Kirill A. Shutemov wrote:
> On Thu, Jun 16, 2022 at 11:44:59AM +0200, Peter Zijlstra wrote:
>> > get_nr_threads() is the wrong thing.  Either look at mm->mm_users or
>> > find a way to get rid of this restriction entirely.
>> 
>> mm->mm_users should indeed be sufficient here.
>
> Hm. kthread_use_mm() doesn't bump mm_users. Do we care?

I think the idea is that the kthread in question is expected to hold an mm_users reference. 

>
> -- 
>  Kirill A. Shutemov

Andy Lutomirski June 30, 2022, 2:29 a.m. UTC | #19

On Tue, Jun 28, 2022, at 5:53 PM, Kirill A. Shutemov wrote:
> On Tue, Jun 28, 2022 at 04:42:40PM -0700, Andy Lutomirski wrote:
>> On 6/10/22 07:35, Kirill A. Shutemov wrote:
>> 
>> > +	/* Update CR3 to get LAM active */
>> > +	switch_mm(current->mm, current->mm, current);
>> 
>> Can you at least justify this oddity?  When changing an LDT, we use a
>> dedicated mechanism.  Is there a significant benefit to abusing switch_mm
>> for this?
>
> I'm not sure I follow. LAM mode is set in CR3. switch_mm() has to handle
> it anyway to context switch. Why do you consider it abuse?
>
>> 
>> Also, why can't we enable LAM on a multithreaded process?  We can change an
>> LDT, and the code isn't even particularly complicated.
>
> I reworked this in v4[1] and it allows multithreaded processes. Have you
> got that version?
>
> Intel had issue with mail server, but I assumed it didn't affect my
> patchset since I see it in the archive.
>

I didn’t notice it. Not quite sure what the issue was. Could just be incompetence on my part.

I think that’s the right idea, except that I think you shouldn’t use switch_mm for this. Just update the LAM bits directly.   Once you read mm_cpumask, you should be guaranteed (see next paragraph) that, for each CPU that isn’t in the set, if it switches to the new mm, it will notice the new LAM.

I say “should be” because I think smp_wmb() is insufficient. You’re ordering a write with a subsequent read, which needs smp_mb().

> [1] 
> https://lore.kernel.org/all/20220622162230.83474-1-kirill.shutemov@linux.intel.com/
>
> -- 
>  Kirill A. Shutemov

kirill.shutemov@linux.intel.com July 1, 2022, 3:38 p.m. UTC | #20

On Wed, Jun 29, 2022 at 07:29:13PM -0700, Andy Lutomirski wrote:
> 
> 
> On Tue, Jun 28, 2022, at 5:53 PM, Kirill A. Shutemov wrote:
> > On Tue, Jun 28, 2022 at 04:42:40PM -0700, Andy Lutomirski wrote:
> >> On 6/10/22 07:35, Kirill A. Shutemov wrote:
> >> 
> >> > +	/* Update CR3 to get LAM active */
> >> > +	switch_mm(current->mm, current->mm, current);
> >> 
> >> Can you at least justify this oddity?  When changing an LDT, we use a
> >> dedicated mechanism.  Is there a significant benefit to abusing switch_mm
> >> for this?
> >
> > I'm not sure I follow. LAM mode is set in CR3. switch_mm() has to handle
> > it anyway to context switch. Why do you consider it abuse?
> >
> >> 
> >> Also, why can't we enable LAM on a multithreaded process?  We can change an
> >> LDT, and the code isn't even particularly complicated.
> >
> > I reworked this in v4[1] and it allows multithreaded processes. Have you
> > got that version?
> >
> > Intel had issue with mail server, but I assumed it didn't affect my
> > patchset since I see it in the archive.
> >
> 
> I didn’t notice it. Not quite sure what the issue was. Could just be
> incompetence on my part.
> 
> I think that’s the right idea, except that I think you shouldn’t use
> switch_mm for this. Just update the LAM bits directly.   Once you read
> mm_cpumask, you should be guaranteed (see next paragraph) that, for each
> CPU that isn’t in the set, if it switches to the new mm, it will notice
> the new LAM.
> 
> I say “should be” because I think smp_wmb() is insufficient. You’re
> ordering a write with a subsequent read, which needs smp_mb().

I think it is better to put smp_mb() to make it explicit.

Does the fixup below look okay?

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 2d70d75e207f..8da54e7b6f98 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -367,4 +367,30 @@ static inline void __native_tlb_flush_global(unsigned long cr4)
 	native_write_cr4(cr4 ^ X86_CR4_PGE);
 	native_write_cr4(cr4);
 }
+
+#ifdef CONFIG_X86_64
+static inline u64 tlbstate_lam_cr3_mask(void)
+{
+	u64 lam = this_cpu_read(cpu_tlbstate.lam);
+
+	return lam << X86_CR3_LAM_U57_BIT;
+}
+
+static inline void set_tlbstate_lam_cr3_mask(u64 mask)
+{
+	this_cpu_write(cpu_tlbstate.lam, mask >> X86_CR3_LAM_U57_BIT);
+}
+
+#else
+
+static inline u64 tlbstate_lam_cr3_mask(void)
+{
+	return 0;
+}
+
+static inline void set_tlbstate_lam_cr3_mask(u64 mask)
+{
+}
+#endif
+
 #endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 427ebef3f64b..cd2b03fe94c4 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -745,15 +745,16 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
 static void enable_lam_func(void *mm)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+	unsigned long lam_mask;
 
 	if (loaded_mm != mm)
 		return;
 
-	/* Counterpart of smp_wmb() in prctl_enable_tagged_addr() */
-	smp_rmb();
+	lam_mask = READ_ONCE(loaded_mm->context.lam_cr3_mask);
 
 	/* Update CR3 to get LAM active on the CPU */
-	switch_mm(loaded_mm, loaded_mm, current);
+	write_cr3(__read_cr3() | lam_mask);
+	set_tlbstate_lam_cr3_mask(lam_mask);
 }
 
 static bool lam_u48_allowed(void)
@@ -805,7 +806,7 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
 	}
 
 	/* Make lam_cr3_mask and untag_mask visible on other CPUs */
-	smp_wmb();
+	smp_mb();
 
 	on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true);
 out:
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c5c4f76329c2..d9a2acdae90f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -486,31 +486,6 @@ void cr4_update_pce(void *ignored)
 static inline void cr4_update_pce_mm(struct mm_struct *mm) { }
 #endif
 
-#ifdef CONFIG_X86_64
-static inline u64 tlbstate_lam_cr3_mask(void)
-{
-	u64 lam = this_cpu_read(cpu_tlbstate.lam);
-
-	return lam << X86_CR3_LAM_U57_BIT;
-}
-
-static inline void set_tlbstate_lam_cr3_mask(u64 mask)
-{
-	this_cpu_write(cpu_tlbstate.lam, mask >> X86_CR3_LAM_U57_BIT);
-}
-
-#else
-
-static inline u64 tlbstate_lam_cr3_mask(void)
-{
-	return 0;
-}
-
-static inline void set_tlbstate_lam_cr3_mask(u64 mask)
-{
-}
-#endif
-
 void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			struct task_struct *tsk)
 {
@@ -581,7 +556,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	 * provides that full memory barrier and core serializing
 	 * instruction.
 	 */
-	if (real_prev == next && prev_lam == new_lam) {
+	if (real_prev == next) {
 		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			   next->context.ctx_id);

Andy Lutomirski July 2, 2022, 11:55 p.m. UTC | #21

On Fri, Jul 1, 2022, at 8:38 AM, Kirill A. Shutemov wrote:
> On Wed, Jun 29, 2022 at 07:29:13PM -0700, Andy Lutomirski wrote:
>> 
>> 
>> On Tue, Jun 28, 2022, at 5:53 PM, Kirill A. Shutemov wrote:
>> > On Tue, Jun 28, 2022 at 04:42:40PM -0700, Andy Lutomirski wrote:
>> >> On 6/10/22 07:35, Kirill A. Shutemov wrote:
>> >> 
>> >> > +	/* Update CR3 to get LAM active */
>> >> > +	switch_mm(current->mm, current->mm, current);
>> >> 
>> >> Can you at least justify this oddity?  When changing an LDT, we use a
>> >> dedicated mechanism.  Is there a significant benefit to abusing switch_mm
>> >> for this?
>> >
>> > I'm not sure I follow. LAM mode is set in CR3. switch_mm() has to handle
>> > it anyway to context switch. Why do you consider it abuse?
>> >
>> >> 
>> >> Also, why can't we enable LAM on a multithreaded process?  We can change an
>> >> LDT, and the code isn't even particularly complicated.
>> >
>> > I reworked this in v4[1] and it allows multithreaded processes. Have you
>> > got that version?
>> >
>> > Intel had issue with mail server, but I assumed it didn't affect my
>> > patchset since I see it in the archive.
>> >
>> 
>> I didn’t notice it. Not quite sure what the issue was. Could just be
>> incompetence on my part.
>> 
>> I think that’s the right idea, except that I think you shouldn’t use
>> switch_mm for this. Just update the LAM bits directly.   Once you read
>> mm_cpumask, you should be guaranteed (see next paragraph) that, for each
>> CPU that isn’t in the set, if it switches to the new mm, it will notice
>> the new LAM.
>> 
>> I say “should be” because I think smp_wmb() is insufficient. You’re
>> ordering a write with a subsequent read, which needs smp_mb().
>
> I think it is better to put smp_mb() to make it explicit.
>
> Does the fixup below look okay?
>
> diff --git a/arch/x86/include/asm/tlbflush.h 
> b/arch/x86/include/asm/tlbflush.h
> index 2d70d75e207f..8da54e7b6f98 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -367,4 +367,30 @@ static inline void 
> __native_tlb_flush_global(unsigned long cr4)
>  	native_write_cr4(cr4 ^ X86_CR4_PGE);
>  	native_write_cr4(cr4);
>  }
> +
> +#ifdef CONFIG_X86_64
> +static inline u64 tlbstate_lam_cr3_mask(void)
> +{
> +	u64 lam = this_cpu_read(cpu_tlbstate.lam);
> +
> +	return lam << X86_CR3_LAM_U57_BIT;
> +}
> +
> +static inline void set_tlbstate_lam_cr3_mask(u64 mask)
> +{
> +	this_cpu_write(cpu_tlbstate.lam, mask >> X86_CR3_LAM_U57_BIT);
> +}
> +
> +#else
> +
> +static inline u64 tlbstate_lam_cr3_mask(void)
> +{
> +	return 0;
> +}
> +
> +static inline void set_tlbstate_lam_cr3_mask(u64 mask)
> +{
> +}
> +#endif
> +
>  #endif /* _ASM_X86_TLBFLUSH_H */
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 427ebef3f64b..cd2b03fe94c4 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -745,15 +745,16 @@ static long prctl_map_vdso(const struct 
> vdso_image *image, unsigned long addr)
>  static void enable_lam_func(void *mm)
>  {
>  	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
> +	unsigned long lam_mask;
> 
>  	if (loaded_mm != mm)
>  		return;
> 
> -	/* Counterpart of smp_wmb() in prctl_enable_tagged_addr() */
> -	smp_rmb();
> +	lam_mask = READ_ONCE(loaded_mm->context.lam_cr3_mask);
> 
>  	/* Update CR3 to get LAM active on the CPU */
> -	switch_mm(loaded_mm, loaded_mm, current);
> +	write_cr3(__read_cr3() | lam_mask);

Perhaps this should also mask off the old LAM mask?

> +	set_tlbstate_lam_cr3_mask(lam_mask);
>  }
> 
>  static bool lam_u48_allowed(void)
> @@ -805,7 +806,7 @@ static int prctl_enable_tagged_addr(struct 
> mm_struct *mm, unsigned long nr_bits)
>  	}
> 
>  	/* Make lam_cr3_mask and untag_mask visible on other CPUs */
> -	smp_wmb();
> +	smp_mb();
> 
>  	on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true);
>  out:
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index c5c4f76329c2..d9a2acdae90f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -486,31 +486,6 @@ void cr4_update_pce(void *ignored)
>  static inline void cr4_update_pce_mm(struct mm_struct *mm) { }
>  #endif
> 
> -#ifdef CONFIG_X86_64
> -static inline u64 tlbstate_lam_cr3_mask(void)
> -{
> -	u64 lam = this_cpu_read(cpu_tlbstate.lam);
> -
> -	return lam << X86_CR3_LAM_U57_BIT;
> -}
> -
> -static inline void set_tlbstate_lam_cr3_mask(u64 mask)
> -{
> -	this_cpu_write(cpu_tlbstate.lam, mask >> X86_CR3_LAM_U57_BIT);
> -}
> -
> -#else
> -
> -static inline u64 tlbstate_lam_cr3_mask(void)
> -{
> -	return 0;
> -}
> -
> -static inline void set_tlbstate_lam_cr3_mask(u64 mask)
> -{
> -}
> -#endif
> -
>  void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  			struct task_struct *tsk)
>  {
> @@ -581,7 +556,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, 
> struct mm_struct *next,
>  	 * provides that full memory barrier and core serializing
>  	 * instruction.
>  	 */
> -	if (real_prev == next && prev_lam == new_lam) {
> +	if (real_prev == next) {
>  		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
>  			   next->context.ctx_id);
> 
> -- 
>  Kirill A. Shutemov

kirill.shutemov@linux.intel.com July 4, 2022, 1:43 p.m. UTC | #22

On Sat, Jul 02, 2022 at 04:55:40PM -0700, Andy Lutomirski wrote:
> > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > index 427ebef3f64b..cd2b03fe94c4 100644
> > --- a/arch/x86/kernel/process_64.c
> > +++ b/arch/x86/kernel/process_64.c
> > @@ -745,15 +745,16 @@ static long prctl_map_vdso(const struct 
> > vdso_image *image, unsigned long addr)
> >  static void enable_lam_func(void *mm)
> >  {
> >  	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
> > +	unsigned long lam_mask;
> > 
> >  	if (loaded_mm != mm)
> >  		return;
> > 
> > -	/* Counterpart of smp_wmb() in prctl_enable_tagged_addr() */
> > -	smp_rmb();
> > +	lam_mask = READ_ONCE(loaded_mm->context.lam_cr3_mask);
> > 
> >  	/* Update CR3 to get LAM active on the CPU */
> > -	switch_mm(loaded_mm, loaded_mm, current);
> > +	write_cr3(__read_cr3() | lam_mask);
> 
> Perhaps this should also mask off the old LAM mask?

So far LAM enabling is one-way operation, so it should be fine.
But I think masking off is good idea to avoid problems in the future.

[PATCHv3,6/8] x86/mm: Provide ARCH_GET_UNTAG_MASK and ARCH_ENABLE_TAGGED_ADDR

Commit Message

Comments

Patch