mbox series

[0/5] use pinned_vm instead of locked_vm to account pinned pages

Message ID 20190211224437.25267-1-daniel.m.jordan@oracle.com (mailing list archive)
Headers show
Series use pinned_vm instead of locked_vm to account pinned pages | expand

Message

Daniel Jordan Feb. 11, 2019, 10:44 p.m. UTC
Hi,

This series converts users that account pinned pages with locked_vm to
account with pinned_vm instead, pinned_vm being the correct counter to
use.  It's based on a similar patch I posted recently[0].

The patches are based on rdma/for-next to build on Davidlohr Bueso's
recent conversion of pinned_vm to an atomic64_t[1].  Seems to make some
sense for these to be routed the same way, despite lack of rdma content?

All five of these places, and probably some of Davidlohr's conversions,
probably want to be collapsed into a common helper in the core mm for
accounting pinned pages.  I tried, and there are several details that
likely need discussion, so this can be done as a follow-on.

I'd appreciate a look at patch 5 especially since the accounting is
unusual no matter whether locked_vm or pinned_vm are used.

On powerpc, this was cross-compile tested only.

[0] http://lkml.kernel.org/r/20181105165558.11698-8-daniel.m.jordan@oracle.com
[1] http://lkml.kernel.org/r/20190206175920.31082-1-dave@stgolabs.net

Daniel Jordan (5):
  vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
  vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned
    pages
  fpga/dlf/afu: use pinned_vm instead of locked_vm to account pinned
    pages
  powerpc/mmu: use pinned_vm instead of locked_vm to account pinned
    pages
  kvm/book3s: use pinned_vm instead of locked_vm to account pinned pages

 Documentation/vfio.txt              |  6 +--
 arch/powerpc/kvm/book3s_64_vio.c    | 35 +++++++---------
 arch/powerpc/mm/mmu_context_iommu.c | 43 ++++++++++---------
 drivers/fpga/dfl-afu-dma-region.c   | 50 +++++++++++-----------
 drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++++++++++++++---------------
 drivers/vfio/vfio_iommu_type1.c     | 31 ++++++--------
 6 files changed, 104 insertions(+), 125 deletions(-)

Comments

Jason Gunthorpe Feb. 11, 2019, 10:54 p.m. UTC | #1
On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> Hi,
> 
> This series converts users that account pinned pages with locked_vm to
> account with pinned_vm instead, pinned_vm being the correct counter to
> use.  It's based on a similar patch I posted recently[0].
> 
> The patches are based on rdma/for-next to build on Davidlohr Bueso's
> recent conversion of pinned_vm to an atomic64_t[1].  Seems to make some
> sense for these to be routed the same way, despite lack of rdma content?

Oy.. I'd be willing to accumulate a branch with acks to send to Linus
*separately* from RDMA to Linus, but this is very abnormal.

Better to wait a few weeks for -rc1 and send patches through the
subsystem trees.

> All five of these places, and probably some of Davidlohr's conversions,
> probably want to be collapsed into a common helper in the core mm for
> accounting pinned pages.  I tried, and there are several details that
> likely need discussion, so this can be done as a follow-on.

I've wondered the same..

Jason
Daniel Jordan Feb. 11, 2019, 11:15 p.m. UTC | #2
On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> > Hi,
> > 
> > This series converts users that account pinned pages with locked_vm to
> > account with pinned_vm instead, pinned_vm being the correct counter to
> > use.  It's based on a similar patch I posted recently[0].
> > 
> > The patches are based on rdma/for-next to build on Davidlohr Bueso's
> > recent conversion of pinned_vm to an atomic64_t[1].  Seems to make some
> > sense for these to be routed the same way, despite lack of rdma content?
> 
> Oy.. I'd be willing to accumulate a branch with acks to send to Linus
> *separately* from RDMA to Linus, but this is very abnormal.
> 
> Better to wait a few weeks for -rc1 and send patches through the
> subsystem trees.

Ok, I can do that.  It did seem strange, so I made it a question...
Ira Weiny Feb. 14, 2019, 1:53 a.m. UTC | #3
On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> 
> > All five of these places, and probably some of Davidlohr's conversions,
> > probably want to be collapsed into a common helper in the core mm for
> > accounting pinned pages.  I tried, and there are several details that
> > likely need discussion, so this can be done as a follow-on.
> 
> I've wondered the same..

I'm really thinking this would be a nice way to ensure it gets cleaned up and
does not happen again.

Also, by moving it to the core we could better manage any user visible changes.

From a high level, pinned is a subset of locked so it seems like we need a 2
sets of helpers.

try_increment_locked_vm(...)
decrement_locked_vm(...)

try_increment_pinned_vm(...)
decrement_pinned_vm(...)

Where try_increment_pinned_vm() also increments locked_vm...  Of course this
may end up reverting the improvement of Davidlohr  Bueso's atomic work...  :-(

Furthermore it would seem better (although I don't know if at all possible) if
this were accounted for in core calls which tracked them based on how the pages
are being used so that drivers can't call try_increment_locked_vm() and then
pin the pages...  Thus getting the account wrong vs what actually happened.

And then in the end we can go back to locked_vm being the value checked against
RLIMIT_MEMLOCK.

Ira
Jason Gunthorpe Feb. 14, 2019, 6 a.m. UTC | #4
On Wed, Feb 13, 2019 at 05:53:14PM -0800, Ira Weiny wrote:
> On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote:
> > On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> > 
> > > All five of these places, and probably some of Davidlohr's conversions,
> > > probably want to be collapsed into a common helper in the core mm for
> > > accounting pinned pages.  I tried, and there are several details that
> > > likely need discussion, so this can be done as a follow-on.
> > 
> > I've wondered the same..
> 
> I'm really thinking this would be a nice way to ensure it gets cleaned up and
> does not happen again.
> 
> Also, by moving it to the core we could better manage any user visible changes.
> 
> From a high level, pinned is a subset of locked so it seems like we need a 2
> sets of helpers.
> 
> try_increment_locked_vm(...)
> decrement_locked_vm(...)
> 
> try_increment_pinned_vm(...)
> decrement_pinned_vm(...)
> 
> Where try_increment_pinned_vm() also increments locked_vm...  Of course this
> may end up reverting the improvement of Davidlohr  Bueso's atomic work...  :-(
> 
> Furthermore it would seem better (although I don't know if at all possible) if
> this were accounted for in core calls which tracked them based on how the pages
> are being used so that drivers can't call try_increment_locked_vm() and then
> pin the pages...  Thus getting the account wrong vs what actually happened.
> 
> And then in the end we can go back to locked_vm being the value checked against
> RLIMIT_MEMLOCK.

Someone would need to understand the bug that was fixed by splitting
them. 

I think it had to do with double accounting pinned and mlocked pages
and thus delivering a lower than expected limit to userspace.

vfio has this bug, RDMA does not. RDMA has a bug where it can
overallocate locked memory, vfio doesn't.

Really unclear how to fix this. The pinned/locked split with two
buckets may be the right way.

Jason
Ira Weiny Feb. 14, 2019, 7:33 p.m. UTC | #5
On Wed, Feb 13, 2019 at 11:00:06PM -0700, Jason Gunthorpe wrote:
> On Wed, Feb 13, 2019 at 05:53:14PM -0800, Ira Weiny wrote:
> > On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote:
> > > On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> > > 
> > > > All five of these places, and probably some of Davidlohr's conversions,
> > > > probably want to be collapsed into a common helper in the core mm for
> > > > accounting pinned pages.  I tried, and there are several details that
> > > > likely need discussion, so this can be done as a follow-on.
> > > 
> > > I've wondered the same..
> > 
> > I'm really thinking this would be a nice way to ensure it gets cleaned up and
> > does not happen again.
> > 
> > Also, by moving it to the core we could better manage any user visible changes.
> > 
> > From a high level, pinned is a subset of locked so it seems like we need a 2
> > sets of helpers.
> > 
> > try_increment_locked_vm(...)
> > decrement_locked_vm(...)
> > 
> > try_increment_pinned_vm(...)
> > decrement_pinned_vm(...)
> > 
> > Where try_increment_pinned_vm() also increments locked_vm...  Of course this
> > may end up reverting the improvement of Davidlohr  Bueso's atomic work...  :-(
> > 
> > Furthermore it would seem better (although I don't know if at all possible) if
> > this were accounted for in core calls which tracked them based on how the pages
> > are being used so that drivers can't call try_increment_locked_vm() and then
> > pin the pages...  Thus getting the account wrong vs what actually happened.
> > 
> > And then in the end we can go back to locked_vm being the value checked against
> > RLIMIT_MEMLOCK.
> 
> Someone would need to understand the bug that was fixed by splitting
> them. 
>

My suggestion above assumes that splitting them is required/correct.  To be
fair I've not dug into if this is true or not, but I trust Christopher.

What I have found is this commit:

bc3e53f682d9 mm: distinguish between mlocked and pinned pages

I think that commit introduced the bug (for IB) which at the time may have been
"ok" because many users of IB at the time were HPC/MPI users and I don't think
MPI does a lot of _separate_ mlock operations so the count of locked_vm was
probably negligible.  Alternatively, the clusters I've worked on in the past
had compute nodes set with RLIMIT_MEMLOCK to 'unlimited' whilst running MPI
applications on compute nodes of a cluster...  :-/

I think what Christopher did was probably ok for the internal tracking but we
_should_ have had something which summed the 2 for RLIMIT_MEMLOCK checking at
that time to be 100% correct?  Christopher do you remember why you did not do
that?

[1] http://lkml.kernel.org/r/20130524140114.GK23650@twins.programming.kicks-ass.net

> 
> I think it had to do with double accounting pinned and mlocked pages
> and thus delivering a lower than expected limit to userspace.
> 
> vfio has this bug, RDMA does not. RDMA has a bug where it can
> overallocate locked memory, vfio doesn't.

Wouldn't vfio also be able to overallocate if the user had RDMA pinned pages?

I think the problem is that if the user calls mlock on a large range then both
vfio and RDMA could potentially overallocate even with this fix.  This was your
initial email to Daniel, I think...  And Alex's concern.

> 
> Really unclear how to fix this. The pinned/locked split with two
> buckets may be the right way.

Are you suggesting that we have 2 user limits?

Ira

> 
> Jason
Jason Gunthorpe Feb. 14, 2019, 8:12 p.m. UTC | #6
On Thu, Feb 14, 2019 at 11:33:53AM -0800, Ira Weiny wrote:

> > I think it had to do with double accounting pinned and mlocked pages
> > and thus delivering a lower than expected limit to userspace.
> > 
> > vfio has this bug, RDMA does not. RDMA has a bug where it can
> > overallocate locked memory, vfio doesn't.
> 
> Wouldn't vfio also be able to overallocate if the user had RDMA pinned pages?

Yes
 
> I think the problem is that if the user calls mlock on a large range then both
> vfio and RDMA could potentially overallocate even with this fix.  This was your
> initial email to Daniel, I think...  And Alex's concern.

Here are the possibilities
- mlock and pin on the same pages - RDMA respects the limit, VFIO halfs it.
- mlock and pin on different pages - RDMA doubles the limit, VFIO
  respects it
- VFIO and RDMA in the same process, the limit is halfed or doubled, depending.

IHMO we should make VFIO & RDMA the same, and then decide what to do
about case #2.

> > Really unclear how to fix this. The pinned/locked split with two
> > buckets may be the right way.
> 
> Are you suggesting that we have 2 user limits?

This is what RDMA has done since CL's patch.

It is very hard to fix as you need to track how many pages are mlocked
*AND* pinned.

Jason
Ira Weiny Feb. 14, 2019, 9:46 p.m. UTC | #7
On Thu, Feb 14, 2019 at 01:12:31PM -0700, Jason Gunthorpe wrote:
> On Thu, Feb 14, 2019 at 11:33:53AM -0800, Ira Weiny wrote:
> 
> > > I think it had to do with double accounting pinned and mlocked pages
> > > and thus delivering a lower than expected limit to userspace.
> > > 
> > > vfio has this bug, RDMA does not. RDMA has a bug where it can
> > > overallocate locked memory, vfio doesn't.
> > 
> > Wouldn't vfio also be able to overallocate if the user had RDMA pinned pages?
> 
> Yes
>  
> > I think the problem is that if the user calls mlock on a large range then both
> > vfio and RDMA could potentially overallocate even with this fix.  This was your
> > initial email to Daniel, I think...  And Alex's concern.
> 
> Here are the possibilities
> - mlock and pin on the same pages - RDMA respects the limit, VFIO halfs it.
> - mlock and pin on different pages - RDMA doubles the limit, VFIO
>   respects it
> - VFIO and RDMA in the same process, the limit is halfed or doubled, depending.
> 
> IHMO we should make VFIO & RDMA the same, and then decide what to do
> about case #2.

I'm not against that.  Sorry if I came across that way.  For this series I
agree we should make it consistent.

> 
> > > Really unclear how to fix this. The pinned/locked split with two
> > > buckets may be the right way.
> > 
> > Are you suggesting that we have 2 user limits?
> 
> This is what RDMA has done since CL's patch.

I don't understand?  What is the other _user_ limit (other than
RLIMIT_MEMLOCK)?

> 
> It is very hard to fix as you need to track how many pages are mlocked
> *AND* pinned.

Understood. :-/

Ira

> 
> Jason
Jason Gunthorpe Feb. 14, 2019, 10:16 p.m. UTC | #8
On Thu, Feb 14, 2019 at 01:46:51PM -0800, Ira Weiny wrote:

> > > > Really unclear how to fix this. The pinned/locked split with two
> > > > buckets may be the right way.
> > > 
> > > Are you suggesting that we have 2 user limits?
> > 
> > This is what RDMA has done since CL's patch.
> 
> I don't understand?  What is the other _user_ limit (other than
> RLIMIT_MEMLOCK)?

With todays implementation RLIMIT_MEMLOCK covers two user limits,
total number of pinned pages and total number of mlocked pages. The
two are different buckets and not summed.

Jason
Christoph Lameter (Ampere) Feb. 15, 2019, 3:26 p.m. UTC | #9
On Thu, 14 Feb 2019, Jason Gunthorpe wrote:

> On Thu, Feb 14, 2019 at 01:46:51PM -0800, Ira Weiny wrote:
>
> > > > > Really unclear how to fix this. The pinned/locked split with two
> > > > > buckets may be the right way.
> > > >
> > > > Are you suggesting that we have 2 user limits?
> > >
> > > This is what RDMA has done since CL's patch.
> >
> > I don't understand?  What is the other _user_ limit (other than
> > RLIMIT_MEMLOCK)?
>
> With todays implementation RLIMIT_MEMLOCK covers two user limits,
> total number of pinned pages and total number of mlocked pages. The
> two are different buckets and not summed.

Applications were failing at some point because they were effectively
summed up. If you mlocked/pinned a dataset of more than half the memory of
a system then things would get really weird.

Also there is the possibility of even more duplication because pages can
be pinned by multiple kernel subsystems. So you could get more than
doubling of the number.

The sane thing was to account them separately so that mlocking and
pinning worked without apps failing and then wait for another genius
to find out how to improve the situation by getting the pinned page mess
under control.

It is not even advisable to check pinned pages against any limit because
pages can be pinned by multiple subsystems.

The main problem here is that we only have a refcount to indicate pinning
and no way to clearly distinguish long term from short pins. In order to
really fix this issue we would need to have a list of subsystems that have
taken long term pins on a page. But doing so would waste a lot of memory
and cause a significant performance regression.

And the discussions here seem to be meandering around these issues.
Nothing really that convinces me that we have a clean solution at hand.