mbox series

[0/3] mm, pmem: Force unmap pmem on surprise remove

Message ID 161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
Headers show
Series mm, pmem: Force unmap pmem on surprise remove | expand

Message

Dan Williams March 18, 2021, 4:08 a.m. UTC
Summary:

A dax_dev can be unbound from its driver at any time. Unbind can not
fail. The driver-core will always trigger ->remove() and the result from
->remove() is ignored. After ->remove() the driver-core proceeds to tear
down context. The filesystem-dax implementation can leave pfns mapped
after ->remove() if it is triggered while the filesystem is mounted.
Security and data-integrity is forfeit if the dax_dev is repurposed for
another security domain (new filesystem or change device modes), or if
the dax_dev is physically replaced. CXL is a hotplug bus that makes
dax_dev physical replace a real world prospect. 

All dax_dev pfns must be unmapped at remove. Detect the "remove while
mounted" case and trigger memory_failure() over the entire dax_dev
range.

Details:

The get_user_pages_fast() path expects all synchronization to be handled
by the pattern of checking for pte presence, taking a page reference,
and then validating that the pte was stable over that event. The
gup-fast path for devmap / DAX pages additionally attempts to take/hold
a live reference against the hosting pgmap over the page pin. The
rational for the pgmap reference is to synchronize against a dax-device
unbind / ->remove() event, but that is unnecessary if pte invalidation
is guaranteed in the ->remove() path.

Global dax-device pte invalidation *does* happen when the device is in
raw "device-dax" mode where there is a single shared inode to unmap at
remove, but the filesystem-dax path has a large number of actively
mapped inodes unknown to the driver at ->remove() time. So, that unmap
does not happen today for filesystem-dax. However, as Jason points out,
that unmap / invalidation *needs* to happen not only to cleanup
get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev
->remove() is correlated with actual physical removal / replacement the
implications of allowing a physical pfn to be exchanged without tearing
down old mappings are severe (security and data-integrity).

What is not in this patch set is coordination with the dax_kmem driver
to trigger memory_failure() when the dax_dev is onlined as "System
RAM". The remove_memory() API was built with the assumption that
platform firmware negotiates all removal requests and the OS has a
chance to say "no". This is why dax_kmem today simply leaks
request_region() to burn that physical address space for any other
usage until the next reboot on a manual unbind event if the memory can't
be offlined. However a future to make sure that remove_memory() succeeds
after memory_failure() of the same range seems a better semantic than
permanently burning physical address space.

The topic of remove_memory() failures gets to the question of what
happens to active page references when the inopportune ->remove() event
happens. For transient pins the ->remove() event will wait for for all
pins to be dropped before allowing ->remove() to complete. Since
fileystem-dax forbids longterm pins all those pins are transient.
Device-dax, on the other hand, does allow longterm pins which means that
->remove() will hang unless / until the longterm pin is dropped.
Hopefully an unmap_mapping_range() event is sufficient to get the pin
dropped, but I suspect device-dax might need to trigger memory_failure()
as well to get the longterm pin holder to wake up and get out of the
way (TBD).

Lest we repeat the "longterm-pin-revoke" debate, which highlighted that
RDMA devices do not respond well to having context torn down, keep in
mind that this proposal is to do a best effort recovery of an event that
should not happen (surprise removal) under nominal operation.

---

Dan Williams (3):
      mm/memory-failure: Prepare for mass memory_failure()
      mm, dax, pmem: Introduce dev_pagemap_failure()
      mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path


 drivers/dax/super.c      |   15 +++++++++++++++
 drivers/nvdimm/pmem.c    |   10 +++++++++-
 drivers/nvdimm/pmem.h    |    1 +
 include/linux/dax.h      |    5 +++++
 include/linux/memremap.h |    5 +++++
 include/linux/mm.h       |    3 +++
 mm/gup.c                 |   38 ++++++++++++++++----------------------
 mm/memory-failure.c      |   36 +++++++++++++++++++++++-------------
 mm/memremap.c            |   11 +++++++++++
 9 files changed, 88 insertions(+), 36 deletions(-)

Comments

David Hildenbrand March 25, 2021, 1:02 p.m. UTC | #1
On 18.03.21 05:08, Dan Williams wrote:
> Summary:
> 
> A dax_dev can be unbound from its driver at any time. Unbind can not
> fail. The driver-core will always trigger ->remove() and the result from
> ->remove() is ignored. After ->remove() the driver-core proceeds to tear
> down context. The filesystem-dax implementation can leave pfns mapped
> after ->remove() if it is triggered while the filesystem is mounted.
> Security and data-integrity is forfeit if the dax_dev is repurposed for
> another security domain (new filesystem or change device modes), or if
> the dax_dev is physically replaced. CXL is a hotplug bus that makes
> dax_dev physical replace a real world prospect.
> 
> All dax_dev pfns must be unmapped at remove. Detect the "remove while
> mounted" case and trigger memory_failure() over the entire dax_dev
> range.
> 
> Details:
> 
> The get_user_pages_fast() path expects all synchronization to be handled
> by the pattern of checking for pte presence, taking a page reference,
> and then validating that the pte was stable over that event. The
> gup-fast path for devmap / DAX pages additionally attempts to take/hold
> a live reference against the hosting pgmap over the page pin. The
> rational for the pgmap reference is to synchronize against a dax-device
> unbind / ->remove() event, but that is unnecessary if pte invalidation
> is guaranteed in the ->remove() path.
> 
> Global dax-device pte invalidation *does* happen when the device is in
> raw "device-dax" mode where there is a single shared inode to unmap at
> remove, but the filesystem-dax path has a large number of actively
> mapped inodes unknown to the driver at ->remove() time. So, that unmap
> does not happen today for filesystem-dax. However, as Jason points out,
> that unmap / invalidation *needs* to happen not only to cleanup
> get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev
> ->remove() is correlated with actual physical removal / replacement the
> implications of allowing a physical pfn to be exchanged without tearing
> down old mappings are severe (security and data-integrity).
> 
> What is not in this patch set is coordination with the dax_kmem driver
> to trigger memory_failure() when the dax_dev is onlined as "System
> RAM". The remove_memory() API was built with the assumption that
> platform firmware negotiates all removal requests and the OS has a
> chance to say "no". This is why dax_kmem today simply leaks
> request_region() to burn that physical address space for any other
> usage until the next reboot on a manual unbind event if the memory can't
> be offlined. However a future to make sure that remove_memory() succeeds
> after memory_failure() of the same range seems a better semantic than
> permanently burning physical address space.

I'd have similar requirements for virtio-mem on forced driver unloading 
(although less of an issue, because in contrast to cxl, it doesn't 
really happen in sane environments). However, I'm afraid there are 
absolutely no guarantees that you can actually offline+rip out memory 
exposed to the buddy, at least in the general case.

I guess it might be possible to some degree if memory was onlined to 
ZONE_MOVABLE (there are still no guarantees, though), however, bets are 
completely off with ZONE_NORMAL. Just imagine the memmap of one dax 
device being allocated from another dax device. You cannot possibly 
invalidate via memory_failure() and rip out the memory without crashing 
the whole system afterwards.