diff mbox series

[v5,04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

Message ID 20220310140911.50924-5-chao.p.peng@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: mm: fd-based approach for supporting KVM guest private memory | expand

Commit Message

Chao Peng March 10, 2022, 2:09 p.m. UTC
Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
memory behave like longterm pinned pages and thus should be accounted to
mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 mm/shmem.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

Comments

Sean Christopherson April 7, 2022, 4:05 p.m. UTC | #1
On Thu, Mar 10, 2022, Chao Peng wrote:
> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
> memory behave like longterm pinned pages and thus should be accounted to
> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  mm/shmem.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 7b43e274c9a2..ae46fb96494b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
>  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
>  				   pgoff_t start, pgoff_t end)
>  {
> -#ifdef CONFIG_MEMFILE_NOTIFIER
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  
> +#ifdef CONFIG_MEMFILE_NOTIFIER
>  	start = max(start, folio->index);
>  	end = min(end, folio->index + folio_nr_pages(folio));
>  
>  	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
>  #endif
> +
> +	if (info->xflags & SHM_F_INACCESSIBLE)
> +		atomic64_sub(end - start, &current->mm->pinned_vm);

As Vishal's to-be-posted selftest discovered, this is broken as current->mm may
be NULL.  Or it may be a completely different mm, e.g. AFAICT there's nothing that
prevents a different process from punching hole in the shmem backing.

I don't see a sane way of tracking this in the backing store unless the inode is
associated with a single mm when it's created, and that opens up a giant can of
worms, e.g. what happens with the accounting if the creating process goes away?

I think the correct approach is to not do the locking automatically for SHM_F_INACCESSIBLE,
and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace knows the
consumers don't support migrate/swap.  That'd require wrapping migrate_page() and then
wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get sorted
out sooner than later.  KVM isn't planning on support migrate/swap for TDX or SNP,
but supporting at least migrate for a software-only implementation a la pKVM should
be relatively straightforward.  On the notifiee side, KVM can terminate the VM if it
gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with
exceptions and/or data corruption (pre-SNP SEV guests) in the guest.

Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so
maybe it's just the page migration path that needs to be updated?
Andy Lutomirski April 7, 2022, 5:09 p.m. UTC | #2
On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
>> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
>> memory behave like longterm pinned pages and thus should be accounted to
>> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
>> 
>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> ---
>>  mm/shmem.c | 25 ++++++++++++++++++++++++-
>>  1 file changed, 24 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 7b43e274c9a2..ae46fb96494b 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
>>  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
>>  				   pgoff_t start, pgoff_t end)
>>  {
>> -#ifdef CONFIG_MEMFILE_NOTIFIER
>>  	struct shmem_inode_info *info = SHMEM_I(inode);
>>  
>> +#ifdef CONFIG_MEMFILE_NOTIFIER
>>  	start = max(start, folio->index);
>>  	end = min(end, folio->index + folio_nr_pages(folio));
>>  
>>  	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
>>  #endif
>> +
>> +	if (info->xflags & SHM_F_INACCESSIBLE)
>> +		atomic64_sub(end - start, &current->mm->pinned_vm);
>
> As Vishal's to-be-posted selftest discovered, this is broken as 
> current->mm may
> be NULL.  Or it may be a completely different mm, e.g. AFAICT there's 
> nothing that
> prevents a different process from punching hole in the shmem backing.
>

How about just not charging the mm in the first place?  There’s precedent: ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current status).

In any case, for an administrator to try to assemble the various rlimits into a coherent policy is, and always has been, quite messy. ISTM cgroup limits, which can actually add across processes usefully, are much better.

So, aside from the fact that these fds aren’t in a filesystem and are thus available by default, I’m not convinced that this accounting is useful or necessary.

Maybe we could just have some switch require to enable creation of private memory in the first place, and anyone who flips that switch without configuring cgroups is subject to DoS.
Chao Peng April 8, 2022, 1:02 p.m. UTC | #3
On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
> > memory behave like longterm pinned pages and thus should be accounted to
> > mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
> > 
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  mm/shmem.c | 25 ++++++++++++++++++++++++-
> >  1 file changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 7b43e274c9a2..ae46fb96494b 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> >  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
> >  				   pgoff_t start, pgoff_t end)
> >  {
> > -#ifdef CONFIG_MEMFILE_NOTIFIER
> >  	struct shmem_inode_info *info = SHMEM_I(inode);
> >  
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> >  	start = max(start, folio->index);
> >  	end = min(end, folio->index + folio_nr_pages(folio));
> >  
> >  	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
> >  #endif
> > +
> > +	if (info->xflags & SHM_F_INACCESSIBLE)
> > +		atomic64_sub(end - start, &current->mm->pinned_vm);
> 
> As Vishal's to-be-posted selftest discovered, this is broken as current->mm may
> be NULL.  Or it may be a completely different mm, e.g. AFAICT there's nothing that
> prevents a different process from punching hole in the shmem backing.
> 
> I don't see a sane way of tracking this in the backing store unless the inode is
> associated with a single mm when it's created, and that opens up a giant can of
> worms, e.g. what happens with the accounting if the creating process goes away?

Yes, I realized this.

> 
> I think the correct approach is to not do the locking automatically for SHM_F_INACCESSIBLE,
> and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace knows the
> consumers don't support migrate/swap.  That'd require wrapping migrate_page() and then
> wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get sorted
> out sooner than later.  KVM isn't planning on support migrate/swap for TDX or SNP,
> but supporting at least migrate for a software-only implementation a la pKVM should
> be relatively straightforward.  On the notifiee side, KVM can terminate the VM if it
> gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with
> exceptions and/or data corruption (pre-SNP SEV guests) in the guest.

SHM_LOCK sounds like a good match.

Thanks,
Chao
> 
> Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so
> maybe it's just the page migration path that needs to be updated?
Sean Christopherson April 8, 2022, 5:56 p.m. UTC | #4
On Thu, Apr 07, 2022, Andy Lutomirski wrote:
> 
> On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote:
> > On Thu, Mar 10, 2022, Chao Peng wrote:
> >> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
> >> memory behave like longterm pinned pages and thus should be accounted to
> >> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
> >> 
> >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> >> ---
> >>  mm/shmem.c | 25 ++++++++++++++++++++++++-
> >>  1 file changed, 24 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index 7b43e274c9a2..ae46fb96494b 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> >>  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
> >>  				   pgoff_t start, pgoff_t end)
> >>  {
> >> -#ifdef CONFIG_MEMFILE_NOTIFIER
> >>  	struct shmem_inode_info *info = SHMEM_I(inode);
> >>  
> >> +#ifdef CONFIG_MEMFILE_NOTIFIER
> >>  	start = max(start, folio->index);
> >>  	end = min(end, folio->index + folio_nr_pages(folio));
> >>  
> >>  	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
> >>  #endif
> >> +
> >> +	if (info->xflags & SHM_F_INACCESSIBLE)
> >> +		atomic64_sub(end - start, &current->mm->pinned_vm);
> >
> > As Vishal's to-be-posted selftest discovered, this is broken as current->mm
> > may be NULL.  Or it may be a completely different mm, e.g. AFAICT there's
> > nothing that prevents a different process from punching hole in the shmem
> > backing.
> >
> 
> How about just not charging the mm in the first place?  There’s precedent:
> ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current
> status).
> 
> In any case, for an administrator to try to assemble the various rlimits into
> a coherent policy is, and always has been, quite messy. ISTM cgroup limits,
> which can actually add across processes usefully, are much better.
> 
> So, aside from the fact that these fds aren’t in a filesystem and are thus
> available by default, I’m not convinced that this accounting is useful or
> necessary.
> 
> Maybe we could just have some switch require to enable creation of private
> memory in the first place, and anyone who flips that switch without
> configuring cgroups is subject to DoS.

I personally have no objection to that, and I'm 99% certain Google doesn't rely
on RLIMIT_MEMLOCK.
David Hildenbrand April 8, 2022, 6:54 p.m. UTC | #5
On 08.04.22 19:56, Sean Christopherson wrote:
> On Thu, Apr 07, 2022, Andy Lutomirski wrote:
>>
>> On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote:
>>> On Thu, Mar 10, 2022, Chao Peng wrote:
>>>> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
>>>> memory behave like longterm pinned pages and thus should be accounted to
>>>> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
>>>>
>>>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>>>> ---
>>>>  mm/shmem.c | 25 ++++++++++++++++++++++++-
>>>>  1 file changed, 24 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index 7b43e274c9a2..ae46fb96494b 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
>>>>  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
>>>>  				   pgoff_t start, pgoff_t end)
>>>>  {
>>>> -#ifdef CONFIG_MEMFILE_NOTIFIER
>>>>  	struct shmem_inode_info *info = SHMEM_I(inode);
>>>>  
>>>> +#ifdef CONFIG_MEMFILE_NOTIFIER
>>>>  	start = max(start, folio->index);
>>>>  	end = min(end, folio->index + folio_nr_pages(folio));
>>>>  
>>>>  	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
>>>>  #endif
>>>> +
>>>> +	if (info->xflags & SHM_F_INACCESSIBLE)
>>>> +		atomic64_sub(end - start, &current->mm->pinned_vm);
>>>
>>> As Vishal's to-be-posted selftest discovered, this is broken as current->mm
>>> may be NULL.  Or it may be a completely different mm, e.g. AFAICT there's
>>> nothing that prevents a different process from punching hole in the shmem
>>> backing.
>>>
>>
>> How about just not charging the mm in the first place?  There’s precedent:
>> ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current
>> status).
>>
>> In any case, for an administrator to try to assemble the various rlimits into
>> a coherent policy is, and always has been, quite messy. ISTM cgroup limits,
>> which can actually add across processes usefully, are much better.
>>
>> So, aside from the fact that these fds aren’t in a filesystem and are thus
>> available by default, I’m not convinced that this accounting is useful or
>> necessary.
>>
>> Maybe we could just have some switch require to enable creation of private
>> memory in the first place, and anyone who flips that switch without
>> configuring cgroups is subject to DoS.
> 
> I personally have no objection to that, and I'm 99% certain Google doesn't rely
> on RLIMIT_MEMLOCK.
> 

It's unnacceptable for distributions to have random unprivileged users
be able to allocate an unlimited amount of unmovable memory. And any
kind of these "switches" won't help a thing because the distribution
will have to enable them either way.

I raised in the past that accounting might be challenging, so it's no
surprise that something popped up now.

RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
past already with secretmem, it's not 100% that good of a fit (unmovable
is worth than mlocked). But it gets the job done for now at least.

So I'm open for alternative to limit the amount of unmovable memory we
might allocate for user space, and then we could convert seretmem as well.

Random switches are not an option.
Kirill A. Shutemov April 11, 2022, 3:32 p.m. UTC | #6
On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote:
> Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so
> maybe it's just the page migration path that needs to be updated?

My early version prevented migration with -ENOTSUPP for
address_space_operations::migratepage().

What's wrong with that approach?
Kirill A. Shutemov April 11, 2022, 3:34 p.m. UTC | #7
On Fri, Apr 08, 2022 at 09:02:54PM +0800, Chao Peng wrote:
> > I think the correct approach is to not do the locking automatically for SHM_F_INACCESSIBLE,
> > and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace knows the
> > consumers don't support migrate/swap.  That'd require wrapping migrate_page() and then
> > wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get sorted
> > out sooner than later.  KVM isn't planning on support migrate/swap for TDX or SNP,
> > but supporting at least migrate for a software-only implementation a la pKVM should
> > be relatively straightforward.  On the notifiee side, KVM can terminate the VM if it
> > gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with
> > exceptions and/or data corruption (pre-SNP SEV guests) in the guest.
> 
> SHM_LOCK sounds like a good match.

Emm, no. shmctl(2) and SHM_LOCK are SysV IPC thing. I don't see how they
fit here.
Hugh Dickins April 12, 2022, 5:14 a.m. UTC | #8
On Mon, 11 Apr 2022, Kirill A. Shutemov wrote:
> On Fri, Apr 08, 2022 at 09:02:54PM +0800, Chao Peng wrote:
> > > I think the correct approach is to not do the locking automatically for SHM_F_INACCESSIBLE,
> > > and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace knows the
> > > consumers don't support migrate/swap.  That'd require wrapping migrate_page() and then
> > > wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get sorted
> > > out sooner than later.  KVM isn't planning on support migrate/swap for TDX or SNP,
> > > but supporting at least migrate for a software-only implementation a la pKVM should
> > > be relatively straightforward.  On the notifiee side, KVM can terminate the VM if it
> > > gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with
> > > exceptions and/or data corruption (pre-SNP SEV guests) in the guest.
> > 
> > SHM_LOCK sounds like a good match.
> 
> Emm, no. shmctl(2) and SHM_LOCK are SysV IPC thing. I don't see how they
> fit here.

I am still struggling to formulate a constructive response on
MFD_INACCESSIBLE in general: but before doing so, let me jump in here
to say that I'm firmly on the side of SHM_LOCK being the right model -
but admittedly not through userspace calling shmctl(2).

Please refer to our last year's posting "[PATCH 10/16] tmpfs: fcntl(fd,
F_MEM_LOCK) to memlock a tmpfs file" for the example of how Shakeel did
it then (though only a small part of that would be needed for this case):
https://lore.kernel.org/linux-mm/54e03798-d836-ae64-f41-4a1d46bc115b@google.com/

And until such time as swapping is enabled, this memlock accounting would
be necessarily entailed by "MFD_INACCESSIBLE", or however that turns out
to be implemented: not something that we could trust userspace to call
separately.

Hugh
Chao Peng April 12, 2022, 1:39 p.m. UTC | #9
On Mon, Apr 11, 2022 at 06:32:33PM +0300, Kirill A. Shutemov wrote:
> On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote:
> > Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so
> > maybe it's just the page migration path that needs to be updated?
> 
> My early version prevented migration with -ENOTSUPP for
> address_space_operations::migratepage().
> 
> What's wrong with that approach?

I previously thought migratepage will not be called since we already
marked the pages as UNMOVABLE, sounds not correct?

Thanks,
Chao
> 
> -- 
>  Kirill A. Shutemov
Jason Gunthorpe April 12, 2022, 2:36 p.m. UTC | #10
On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:

> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
> past already with secretmem, it's not 100% that good of a fit (unmovable
> is worth than mlocked). But it gets the job done for now at least.

No, it doesn't. There are too many different interpretations how
MELOCK is supposed to work

eg VFIO accounts per-process so hostile users can just fork to go past
it.

RDMA is per-process but uses a different counter, so you can double up

iouring is per-user and users a 3rd counter, so it can triple up on
the above two

> So I'm open for alternative to limit the amount of unmovable memory we
> might allocate for user space, and then we could convert seretmem as well.

I think it has to be cgroup based considering where we are now :\

Jason
Kirill A. Shutemov April 12, 2022, 7:28 p.m. UTC | #11
On Tue, Apr 12, 2022 at 09:39:25PM +0800, Chao Peng wrote:
> On Mon, Apr 11, 2022 at 06:32:33PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote:
> > > Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so
> > > maybe it's just the page migration path that needs to be updated?
> > 
> > My early version prevented migration with -ENOTSUPP for
> > address_space_operations::migratepage().
> > 
> > What's wrong with that approach?
> 
> I previously thought migratepage will not be called since we already
> marked the pages as UNMOVABLE, sounds not correct?

Do you mean missing __GFP_MOVABLE? I can be wrong, but I don't see that it
direclty affects if the page is migratable. It is a hint to page allocator
to group unmovable pages to separate page block and impove availablity of
higher order pages this way. Page allocator tries to allocate unmovable
pages from pages blocks that already have unmovable pages.
Andy Lutomirski April 12, 2022, 9:27 p.m. UTC | #12
On Tue, Apr 12, 2022, at 7:36 AM, Jason Gunthorpe wrote:
> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
>
>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
>> past already with secretmem, it's not 100% that good of a fit (unmovable
>> is worth than mlocked). But it gets the job done for now at least.
>
> No, it doesn't. There are too many different interpretations how
> MELOCK is supposed to work
>
> eg VFIO accounts per-process so hostile users can just fork to go past
> it.
>
> RDMA is per-process but uses a different counter, so you can double up
>
> iouring is per-user and users a 3rd counter, so it can triple up on
> the above two
>
>> So I'm open for alternative to limit the amount of unmovable memory we
>> might allocate for user space, and then we could convert seretmem as well.
>
> I think it has to be cgroup based considering where we are now :\
>

So this is another situation where the actual backend (TDX, SEV, pKVM, pure software) makes a difference -- depending on exactly what backend we're using, the memory may not be unmoveable.  It might even be swappable (in the potentially distant future).

Anyway, here's a concrete proposal, with a bit of handwaving:

We add new cgroup limits:

memory.unmoveable
memory.locked

These can be set to an actual number or they can be set to the special value ROOT_CAP.  If they're set to ROOT_CAP, then anyone in the cgroup with capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or locked memory with this (and potentially other) new APIs.  If it's 0, then they can't.  If it's another value, then the memory can be allocated, charged to the cgroup, up to the limit, with no particular capability needed.  The default at boot is ROOT_CAP.  Anyone who wants to configure it differently is free to do so.  This avoids introducing a DoS, makes it easy to run tests without configuring cgroup, and lets serious users set up their cgroups.

Nothing is charge per mm.

To make this fully sensible, we need to know what the backend is for the private memory before allocating any so that we can charge it accordingly.
Chao Peng April 13, 2022, 9:15 a.m. UTC | #13
On Tue, Apr 12, 2022 at 10:28:21PM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 12, 2022 at 09:39:25PM +0800, Chao Peng wrote:
> > On Mon, Apr 11, 2022 at 06:32:33PM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote:
> > > > Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so
> > > > maybe it's just the page migration path that needs to be updated?
> > > 
> > > My early version prevented migration with -ENOTSUPP for
> > > address_space_operations::migratepage().
> > > 
> > > What's wrong with that approach?
> > 
> > I previously thought migratepage will not be called since we already
> > marked the pages as UNMOVABLE, sounds not correct?
> 
> Do you mean missing __GFP_MOVABLE?

Yes.

> I can be wrong, but I don't see that it
> direclty affects if the page is migratable. It is a hint to page allocator
> to group unmovable pages to separate page block and impove availablity of
> higher order pages this way. Page allocator tries to allocate unmovable
> pages from pages blocks that already have unmovable pages.

OK, thanks.

Chao
> 
> -- 
>  Kirill A. Shutemov
David Hildenbrand April 13, 2022, 4:24 p.m. UTC | #14
On 12.04.22 16:36, Jason Gunthorpe wrote:
> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
> 
>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
>> past already with secretmem, it's not 100% that good of a fit (unmovable
>> is worth than mlocked). But it gets the job done for now at least.
> 
> No, it doesn't. There are too many different interpretations how
> MELOCK is supposed to work
> 
> eg VFIO accounts per-process so hostile users can just fork to go past
> it.
> 
> RDMA is per-process but uses a different counter, so you can double up
> 
> iouring is per-user and users a 3rd counter, so it can triple up on
> the above two

Thanks for that summary, very helpful.

> 
>> So I'm open for alternative to limit the amount of unmovable memory we
>> might allocate for user space, and then we could convert seretmem as well.
> 
> I think it has to be cgroup based considering where we are now :\

Most probably. I think the important lessons we learned are that

* mlocked != unmovable.
* RLIMIT_MEMLOCK should most probably never have been abused for
  unmovable memory (especially, long-term pinning)
David Hildenbrand April 13, 2022, 4:30 p.m. UTC | #15
> 
> So this is another situation where the actual backend (TDX, SEV, pKVM, pure software) makes a difference -- depending on exactly what backend we're using, the memory may not be unmoveable.  It might even be swappable (in the potentially distant future).

Right. And on a system without swap we don't particularly care about
mlock, but we might (in most cases) care about fragmentation with
unmovable memory.

> 
> Anyway, here's a concrete proposal, with a bit of handwaving:

Thanks for investing some brainpower.

> 
> We add new cgroup limits:
> 
> memory.unmoveable
> memory.locked
> 
> These can be set to an actual number or they can be set to the special value ROOT_CAP.  If they're set to ROOT_CAP, then anyone in the cgroup with capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or locked memory with this (and potentially other) new APIs.  If it's 0, then they can't.  If it's another value, then the memory can be allocated, charged to the cgroup, up to the limit, with no particular capability needed.  The default at boot is ROOT_CAP.  Anyone who wants to configure it differently is free to do so.  This avoids introducing a DoS, makes it easy to run tests without configuring cgroup, and lets serious users set up their cgroups.

I wonder what the implications are for existing user space.

Assume we want to move page pinning (rdma, vfio, io_uring, ...) to the
new model. How can we be sure

a) We don't break existing user space
b) We don't open the doors unnoticed for the admin to go crazy on
   unmovable memory.

Any ideas?

> 
> Nothing is charge per mm.
> 
> To make this fully sensible, we need to know what the backend is for the private memory before allocating any so that we can charge it accordingly.

Right, the support for migration and/or swap defines how to account.
Jason Gunthorpe April 13, 2022, 5:52 p.m. UTC | #16
On Wed, Apr 13, 2022 at 06:24:56PM +0200, David Hildenbrand wrote:
> On 12.04.22 16:36, Jason Gunthorpe wrote:
> > On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
> > 
> >> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
> >> past already with secretmem, it's not 100% that good of a fit (unmovable
> >> is worth than mlocked). But it gets the job done for now at least.
> > 
> > No, it doesn't. There are too many different interpretations how
> > MELOCK is supposed to work
> > 
> > eg VFIO accounts per-process so hostile users can just fork to go past
> > it.
> > 
> > RDMA is per-process but uses a different counter, so you can double up
> > 
> > iouring is per-user and users a 3rd counter, so it can triple up on
> > the above two
> 
> Thanks for that summary, very helpful.

I kicked off a big discussion when I suggested to change vfio to use
the same as io_uring

We may still end up trying it, but the major concern is that libvirt
sets the RLIMIT_MEMLOCK and if we touch anything here - including
fixing RDMA, or anything really, it becomes a uAPI break for libvirt..

> >> So I'm open for alternative to limit the amount of unmovable memory we
> >> might allocate for user space, and then we could convert seretmem as well.
> > 
> > I think it has to be cgroup based considering where we are now :\
> 
> Most probably. I think the important lessons we learned are that
> 
> * mlocked != unmovable.
> * RLIMIT_MEMLOCK should most probably never have been abused for
>   unmovable memory (especially, long-term pinning)

The trouble is I'm not sure how anything can correctly/meaningfully
set a limit.

Consider qemu where we might have 3 different things all pinning the
same page (rdma, iouring, vfio) - should the cgroup give 3x the limit?
What use is that really?

IMHO there are only two meaningful scenarios - either you are unpriv
and limited to a very small number for your user/cgroup - or you are
priv and you can do whatever you want.

The idea we can fine tune this to exactly the right amount for a
workload does not seem realistic and ends up exporting internal kernel
decisions into a uAPI..

Jason
David Hildenbrand April 25, 2022, 2:07 p.m. UTC | #17
On 13.04.22 19:52, Jason Gunthorpe wrote:
> On Wed, Apr 13, 2022 at 06:24:56PM +0200, David Hildenbrand wrote:
>> On 12.04.22 16:36, Jason Gunthorpe wrote:
>>> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
>>>
>>>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
>>>> past already with secretmem, it's not 100% that good of a fit (unmovable
>>>> is worth than mlocked). But it gets the job done for now at least.
>>>
>>> No, it doesn't. There are too many different interpretations how
>>> MELOCK is supposed to work
>>>
>>> eg VFIO accounts per-process so hostile users can just fork to go past
>>> it.
>>>
>>> RDMA is per-process but uses a different counter, so you can double up
>>>
>>> iouring is per-user and users a 3rd counter, so it can triple up on
>>> the above two
>>
>> Thanks for that summary, very helpful.
> 
> I kicked off a big discussion when I suggested to change vfio to use
> the same as io_uring
> 
> We may still end up trying it, but the major concern is that libvirt
> sets the RLIMIT_MEMLOCK and if we touch anything here - including
> fixing RDMA, or anything really, it becomes a uAPI break for libvirt..
> 

Okay, so we have to introduce a second mechanism, don't use
RLIMIT_MEMLOCK for new unmovable memory, and then eventually phase out
RLIMIT_MEMLOCK usage for existing unmovable memory consumers (which, as
you say, will be difficult).

>>>> So I'm open for alternative to limit the amount of unmovable memory we
>>>> might allocate for user space, and then we could convert seretmem as well.
>>>
>>> I think it has to be cgroup based considering where we are now :\
>>
>> Most probably. I think the important lessons we learned are that
>>
>> * mlocked != unmovable.
>> * RLIMIT_MEMLOCK should most probably never have been abused for
>>   unmovable memory (especially, long-term pinning)
> 
> The trouble is I'm not sure how anything can correctly/meaningfully
> set a limit.
> 
> Consider qemu where we might have 3 different things all pinning the
> same page (rdma, iouring, vfio) - should the cgroup give 3x the limit?
> What use is that really?

I think your tackling a related problem, that we double-account
unmovable/mlocked memory due to lack of ways to track that a page is
already pinned by the same user/cgroup/whatsoever. Not easy to solve.

The problem also becomes interesting if iouring with fixed buffers
doesn't work on guest RAM, but on some other QEMU buffers.

> 
> IMHO there are only two meaningful scenarios - either you are unpriv
> and limited to a very small number for your user/cgroup - or you are
> priv and you can do whatever you want.
> 
> The idea we can fine tune this to exactly the right amount for a
> workload does not seem realistic and ends up exporting internal kernel
> decisions into a uAPI..


IMHO, there are three use cases:

* App that conditionally uses selected mechanism that end up requiring
  unmovable, long-term allocations. Secretmem, iouring, rdma. We want
  some sane, small default. Apps have a backup path in case any such
  mechanism fails because we're out of allowed unmovable resources.
* App that relies on selected mechanism that end up requiring unmovable,
  long-term allocations. E.g., vfio with known memory consumption, such
  as the VM size. It's fairly easy to come up with the right value.
* App that relies on multiple mechanism that end up requiring unmovable,
  long-term allocations. QEMU with rdma, iouring, vfio, ... I agree that
  coming up with something good is problematic.

Then, there are privileged/unprivileged apps. There might be admins that
just don't care. There might be admins that even want to set some limit
instead of configuring "unlimited" for QEMU.

Long story short, it should be an admin choice what to configure,
especially:
* What the default is for random apps
* What the maximum is for selected apps
* Which apps don't have a maximum
diff mbox series

Patch

diff --git a/mm/shmem.c b/mm/shmem.c
index 7b43e274c9a2..ae46fb96494b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -915,14 +915,17 @@  static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
 static void notify_invalidate_page(struct inode *inode, struct folio *folio,
 				   pgoff_t start, pgoff_t end)
 {
-#ifdef CONFIG_MEMFILE_NOTIFIER
 	struct shmem_inode_info *info = SHMEM_I(inode);
 
+#ifdef CONFIG_MEMFILE_NOTIFIER
 	start = max(start, folio->index);
 	end = min(end, folio->index + folio_nr_pages(folio));
 
 	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
 #endif
+
+	if (info->xflags & SHM_F_INACCESSIBLE)
+		atomic64_sub(end - start, &current->mm->pinned_vm);
 }
 
 /*
@@ -2680,6 +2683,20 @@  static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+static bool memlock_limited(unsigned long npages)
+{
+	unsigned long lock_limit;
+	unsigned long pinned;
+
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	pinned = atomic64_add_return(npages, &current->mm->pinned_vm);
+	if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
+		atomic64_sub(npages, &current->mm->pinned_vm);
+		return true;
+	}
+	return false;
+}
+
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
@@ -2753,6 +2770,12 @@  static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		goto out;
 	}
 
+	if ((info->xflags & SHM_F_INACCESSIBLE) &&
+			memlock_limited(end - start)) {
+		error = -ENOMEM;
+		goto out;
+	}
+
 	shmem_falloc.waitq = NULL;
 	shmem_falloc.start = start;
 	shmem_falloc.next  = start;