mm: Revert pinned_vm braindamage

Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages")
broke RLIMIT_MEMLOCK.

Before that patch: mm_struct::locked_vm < RLIMIT_MEMLOCK; after that
patch we have: mm_struct::locked_vm < RLIMIT_MEMLOCK &&
mm_struct::pinned_vm < RLIMIT_MEMLOCK.

The patch doesn't mention RLIMIT_MEMLOCK and thus also doesn't discus
this (user visible) change in semantics. And thus we must assume it was
unintentional.

Since RLIMIT_MEMLOCK is very clearly a limit on the amount of pages the
process can 'lock' into memory it should very much include pinned pages
as well as mlock()ed pages. Neither can be paged.

Since nobody had anything constructive to say about the VM_PINNED
approach and the IB code hurts my head too much to make it work I
propose we revert said patch.

Once again the rationale; MLOCK(2) is part of POSIX Realtime Extentsion
(1003.1b-1993/1003.1i-1995). It states that the specified part of the
user address space should stay memory resident until either program exit
or a matching munlock() call.

This definition basically excludes major faults from happening on the
pages -- a major fault being one where IO needs to happen to obtain the
page content; the direct implication being that page content must remain
in memory.

Linux has taken this literal and made mlock()ed pages subject to page
migration (albeit only for the explicit move_pages() syscall; but it
would very much like to make them subject to implicit page migration for
the purpose of compaction etc.).

This view disregards the intention of the spec; since mlock() is part of
the realtime spec the intention is very much that the user address range
generate no faults; neither minor nor major -- any delay is
unacceptable.

This leaves the RT people unhappy -- therefore _if_ we continue with
this Linux specific interpretation of mlock() we must introduce new
syscalls that implement the intended mlock() semantics.

It was found that there are useful purposes for this weaker mlock(), a
rationale to indeed have two sets of syscalls. The weaker mlock() can be
used in the context of security -- where we avoid sensitive data being
written to disk, and in the context of userspace deamons that are part
of the IO path -- which would otherwise form IO deadlocks.

The proposed second set of primitives would be mpin() and munpin() and
would implement the intended mlock() semantics.

Such pages would not be migratable in any way (a possible
implementation would be to 'pin' the pages using an extra refcount on
the page frame). From the above we can see that any mpin()ed page is
also an mlock()ed page, since mpin() will disallow any fault, and thus
will also disallow major faults.

While we still lack the formal mpin() and munpin() syscalls there are a
number of sites that have similar 'side effects' and result in user
controlled 'pinning' of pages. Namely IB and perf.

For the purpose of RLIMIT_MEMLOCK we must use intent only as it is not
part of the formal spec. The only useful thing is to limit the amount of
pages a user can exempt from paging. This would therefore include all
pages either mlock()ed or mpin()ed.

Back to the patch; a resource limit must have a resource counter to
enact the limit upon. Before the patch this was mm_struct::locked_vm.
After the patch there is no such thing left.

The patch was proposed to 'fix' a double accounting problem where pages
are both pinned and mlock()ed. This was particularly visible when using
mlockall() on a process that uses either IB or perf.

I state that since mlockall() disables/invalidates RLIMIT_MEMLOCK the
actual resource counter value is irrelevant, and thus the reported
problem is a non-problem.

However, it would still be possible to observe weirdness in the very
unlikely event that a user would indeed call mlock() upon an address
range obtained from IB/perf. In this case he would be unduly constrained
and find his effective RLIMIT_MEMLOCK limit halved (at worst).

After the patch; that same user will find he has an effectively double
RLIMIT_MEMLOCK, since the IB/perf pages are not counted towards the same
limit as his mlock() pages are. It is far more likely a user will employ
mlock() on different rages than those he received from IB/perf since he
already knows those aren't going anywhere.

Therefore the patch trades an unlikely weirdness for a much more likely
weirdness. So barring a proper solution I propose we revert.

I've yet to hear a coherent objection to the above. Christoph is always
quick to yell: 'but if fixes a double accounting issue' but is
completely deaf to the fact that he changed user visible semantics
without mention and regard.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 drivers/infiniband/core/umem.c                 | 8 ++++----
 drivers/infiniband/hw/ipath/ipath_user_pages.c | 6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c     | 4 ++--
 fs/proc/task_mmu.c                             | 2 --
 include/linux/mm_types.h                       | 1 -
 kernel/events/core.c                           | 6 +++---
 6 files changed, 12 insertions(+), 15 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

mm: Revert pinned_vm braindamage

Commit Message

Comments

Patch