[GIT,PULL] device-dax for 5.1: PMEM as RAM
mbox series

Message ID CAPcyv4he0q_FdqqiXarp0bXjcggs8QZX8Od560E2iFxzCU3Qag@mail.gmail.com
State New
Headers show
Series
  • [GIT,PULL] device-dax for 5.1: PMEM as RAM
Related show

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Message

Dan Williams March 10, 2019, 7:54 p.m. UTC
Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/devdax-for-5.1

...to receive new device-dax infrastructure to allow persistent memory
and other "reserved" / performance differentiated memories, to be
assigned to the core-mm as "System RAM".

While it has soaked in -next with only a simple conflict reported, and
Michal looked at this and said "overall design of this feature makes a
lot of sense to me" [1], it's lacking non-Intel review/ack tags. For
that reason, here's some more commentary on the motivation and
implications:

[1]: https://lore.kernel.org/lkml/20190123170518.GC4087@dhcp22.suse.cz/

Some users want to use persistent memory as additional volatile
memory. They are willing to cope with potential performance
differences, for example between DRAM and 3D Xpoint, and want to use
typical Linux memory management apis rather than a userspace memory
allocator layered over an mmap() of a dax file. The administration
model is to decide how much Persistent Memory (pmem) to use as System
RAM, create a device-dax-mode namespace of that size, and then assign
it to the core-mm. The rationale for device-dax is that it is a
generic memory-mapping driver that can be layered over any "special
purpose" memory, not just pmem. On subsequent boots udev rules can be
used to restore the memory assignment.

One implication of using pmem as RAM is that mlock() no longer keeps
data off persistent media. For this reason it is recommended to enable
NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
at rest. We considered making this recommendation an actively enforced
requirement, but in the end decided to leave it as a distribution /
administrator policy to allow for emulation and test environments that
lack security capable NVDIMMs.

Here is the resolution for the aforementioned conflict:

diff --cc mm/memory_hotplug.c
index a9d5787044e1,b37f3a5c4833..c4f59ac21014
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@@ -102,28 -99,21 +102,24 @@@ u64 max_mem_size = U64_MAX
  /* add this memory to iomem resource */
  static struct resource *register_memory_resource(u64 start, u64 size)
  {
-       struct resource *res, *conflict;
+       struct resource *res;
+       unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+       char *resource_name = "System RAM";

 +      if (start + size > max_mem_size)
 +              return ERR_PTR(-E2BIG);
 +
-       res = kzalloc(sizeof(struct resource), GFP_KERNEL);
-       if (!res)
-               return ERR_PTR(-ENOMEM);
-
-       res->name = "System RAM";
-       res->start = start;
-       res->end = start + size - 1;
-       res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-       conflict =  request_resource_conflict(&iomem_resource, res);
-       if (conflict) {
-               if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
-                       pr_debug("Device unaddressable memory block "
-                                "memory hotplug at %#010llx !\n",
-                                (unsigned long long)start);
-               }
-               pr_debug("System RAM resource %pR cannot be added\n", res);
-               kfree(res);
+       /*
+        * Request ownership of the new memory range.  This might be
+        * a child of an existing resource that was present but
+        * not marked as busy.
+        */
+       res = __request_region(&iomem_resource, start, size,
+                              resource_name, flags);
+
+       if (!res) {
+               pr_debug("Unable to reserve System RAM region:
%016llx->%016llx\n",
+                               start, start + size);
                return ERR_PTR(-EEXIST);
        }
        return res;


* Note, I'm sending this with Gmail rather than Evolution (which goes
through my local Exchange server) as the latter mangles the message
into something the pr-tracker-bot decides to ignore. As a result,
please forgive white-space damage.

---

The following changes since commit bfeffd155283772bbe78c6a05dec7c0128ee500c:

  Linux 5.0-rc1 (2019-01-06 17:08:20 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/devdax-for-5.1

for you to fetch changes up to c221c0b0308fd01d9fb33a16f64d2fd95f8830a4:

  device-dax: "Hotplug" persistent memory for use like normal RAM
(2019-02-28 10:41:23 -0800)

----------------------------------------------------------------
device-dax for 5.1
* Replace the /sys/class/dax device model with /sys/bus/dax, and include
  a compat driver so distributions can opt-in to the new ABI.

* Allow for an alternative driver for the device-dax address-range

* Introduce the 'kmem' driver to hotplug / assign a device-dax
  address-range to the core-mm.

* Arrange for the device-dax target-node to be onlined so that the newly
  added memory range can be uniquely referenced by numa apis.

----------------------------------------------------------------
Dan Williams (11):
      device-dax: Kill dax_region ida
      device-dax: Kill dax_region base
      device-dax: Remove multi-resource infrastructure
      device-dax: Start defining a dax bus model
      device-dax: Introduce bus + driver model
      device-dax: Move resource pinning+mapping into the common driver
      device-dax: Add support for a dax override driver
      device-dax: Add /sys/class/dax backwards compatibility
      acpi/nfit, device-dax: Identify differentiated memory with a
unique numa-node
      device-dax: Auto-bind device after successful new_id
      device-dax: Add a 'target_node' attribute

Dave Hansen (5):
      mm/resource: Return real error codes from walk failures
      mm/resource: Move HMM pr_debug() deeper into resource code
      mm/memory-hotplug: Allow memory resources to be children
      mm/resource: Let walk_system_ram_range() search child resources
      device-dax: "Hotplug" persistent memory for use like normal RAM

Vishal Verma (1):
      device-dax: Add a 'modalias' attribute to DAX 'bus' devices

 Documentation/ABI/obsolete/sysfs-class-dax |  22 ++
 arch/powerpc/platforms/pseries/papr_scm.c  |   1 +
 drivers/acpi/nfit/core.c                   |   8 +-
 drivers/acpi/numa.c                        |   1 +
 drivers/base/memory.c                      |   1 +
 drivers/dax/Kconfig                        |  28 +-
 drivers/dax/Makefile                       |   6 +-
 drivers/dax/bus.c                          | 503 +++++++++++++++++++++++++++++
 drivers/dax/bus.h                          |  61 ++++
 drivers/dax/dax-private.h                  |  34 +-
 drivers/dax/dax.h                          |  18 --
 drivers/dax/device-dax.h                   |  25 --
 drivers/dax/device.c                       | 363 +++++----------------
 drivers/dax/kmem.c                         | 108 +++++++
 drivers/dax/pmem.c                         | 153 ---------
 drivers/dax/pmem/Makefile                  |   7 +
 drivers/dax/pmem/compat.c                  |  73 +++++
 drivers/dax/pmem/core.c                    |  71 ++++
 drivers/dax/pmem/pmem.c                    |  40 +++
 drivers/dax/super.c                        |  41 ++-
 drivers/nvdimm/e820.c                      |   1 +
 drivers/nvdimm/nd.h                        |   2 +-
 drivers/nvdimm/of_pmem.c                   |   1 +
 drivers/nvdimm/region_devs.c               |   1 +
 include/linux/acpi.h                       |   5 +
 include/linux/libnvdimm.h                  |   1 +
 kernel/resource.c                          |  18 +-
 mm/memory_hotplug.c                        |  33 +-
 tools/testing/nvdimm/Kbuild                |   7 +-
 tools/testing/nvdimm/dax-dev.c             |  16 +-
 30 files changed, 1112 insertions(+), 537 deletions(-)
 create mode 100644 Documentation/ABI/obsolete/sysfs-class-dax
 create mode 100644 drivers/dax/bus.c
 create mode 100644 drivers/dax/bus.h
 delete mode 100644 drivers/dax/dax.h
 delete mode 100644 drivers/dax/device-dax.h
 create mode 100644 drivers/dax/kmem.c
 delete mode 100644 drivers/dax/pmem.c
 create mode 100644 drivers/dax/pmem/Makefile
 create mode 100644 drivers/dax/pmem/compat.c
 create mode 100644 drivers/dax/pmem/core.c
 create mode 100644 drivers/dax/pmem/pmem.c

Comments

Linus Torvalds March 10, 2019, 8:01 p.m. UTC | #1
On Sun, Mar 10, 2019 at 12:54 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Hi Linus, please pull from:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
> tags/devdax-for-5.1
>
> ...to receive new device-dax infrastructure to allow persistent memory
> and other "reserved" / performance differentiated memories, to be
> assigned to the core-mm as "System RAM".

I'm not pulling this until I get official Intel clarification on the
whole "pmem vs rep movs vs machine check" behavior.

Last I saw it was deadly and didn't work, and we have a whole "mc-safe
memory copy" thing for it in the kernel because repeat string
instructions didn't work correctly on nvmem.

No way am I exposing any users to something like that.

We need a way to know when it works and when it doesn't, and only do
it when it's safe.

                    Linus
Dan Williams March 10, 2019, 11:54 p.m. UTC | #2
[ add Tony, who has wrestled with how to detect rep; movs recover-ability ]

On Sun, Mar 10, 2019 at 1:02 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, Mar 10, 2019 at 12:54 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Hi Linus, please pull from:
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
> > tags/devdax-for-5.1
> >
> > ...to receive new device-dax infrastructure to allow persistent memory
> > and other "reserved" / performance differentiated memories, to be
> > assigned to the core-mm as "System RAM".
>
> I'm not pulling this until I get official Intel clarification on the
> whole "pmem vs rep movs vs machine check" behavior.
>
> Last I saw it was deadly and didn't work, and we have a whole "mc-safe
> memory copy" thing for it in the kernel because repeat string
> instructions didn't work correctly on nvmem.
>
> No way am I exposing any users to something like that.
>
> We need a way to know when it works and when it doesn't, and only do
> it when it's safe.

Unfortunately this particular b0rkage is not constrained to nvmem.
I.e. there's nothing specific about nvmem requiring mc-safe memory
copy, it's a cpu problem consuming any poison regardless of
source-media-type with "rep; movs".
Linus Torvalds March 11, 2019, 12:21 a.m. UTC | #3
On Sun, Mar 10, 2019 at 4:54 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Unfortunately this particular b0rkage is not constrained to nvmem.
> I.e. there's nothing specific about nvmem requiring mc-safe memory
> copy, it's a cpu problem consuming any poison regardless of
> source-media-type with "rep; movs".

So why is it sold and used for the nvdimm pmem driver?

People told me it was a big deal and machines died.

You can't suddenly change the story just because you want to expose it
to user space.

You can't have it both ways. Either nvdimms have more likelihood of,
and problems with, machine checks, or it doesn't.

The end result is the same: if intel believes the kernel needs to
treat nvdimms specially, then we're sure as hell not exposing those
snowflakes to user space.

And if intel *doesn't* believe that, then we're removing the mcsafe_* functions.

There's no "oh, it's safe to show to user space, but the kernel is
magical" middle ground here that makes sense to me.

                Linus
Dan Williams March 11, 2019, 3:37 p.m. UTC | #4
On Sun, Mar 10, 2019 at 5:22 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, Mar 10, 2019 at 4:54 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Unfortunately this particular b0rkage is not constrained to nvmem.
> > I.e. there's nothing specific about nvmem requiring mc-safe memory
> > copy, it's a cpu problem consuming any poison regardless of
> > source-media-type with "rep; movs".
>
> So why is it sold and used for the nvdimm pmem driver?
>
> People told me it was a big deal and machines died.
>
> You can't suddenly change the story just because you want to expose it
> to user space.
>
> You can't have it both ways. Either nvdimms have more likelihood of,
> and problems with, machine checks, or it doesn't.
>
> The end result is the same: if intel believes the kernel needs to
> treat nvdimms specially, then we're sure as hell not exposing those
> snowflakes to user space.
>
> And if intel *doesn't* believe that, then we're removing the mcsafe_* functions.
>
> There's no "oh, it's safe to show to user space, but the kernel is
> magical" middle ground here that makes sense to me.

I don't think anyone is trying to claim both ways... the mcsafe memcpy
is not implemented because NVDIMMs have a higher chance of
encountering poison, it's implemented because the pmem driver affords
an error model that just isn't possible in other kernel poison
consumption paths. Even if this issue didn't exist there would still
be a rep; mov based mcsafe memcpy for the driver to use on the
expectation that userspace would prefer EIO to a reboot for
kernel-space consumed poison.

That said, I agree with the argument that a kernel mcsafe copy is not
sufficient when DAX is there to arrange for the bulk of
memory-mapped-I/O to be issued from userspace.

Another feature the userspace tooling can support for the PMEM as RAM
case is the ability to complete an Address Range Scrub of the range
before it is added to the core-mm. I.e at least ensure that previously
encountered poison is eliminated. The driver can also publish an
attribute to indicate when rep; mov is recoverable, and gate the
hotplug policy on the result. In my opinion a positive indicator of
the cpu's ability to recover rep; mov exceptions is a gap that needs
addressing.
Linus Torvalds March 12, 2019, 12:07 a.m. UTC | #5
On Mon, Mar 11, 2019 at 8:37 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Another feature the userspace tooling can support for the PMEM as RAM
> case is the ability to complete an Address Range Scrub of the range
> before it is added to the core-mm. I.e at least ensure that previously
> encountered poison is eliminated.

Ok, so this at least makes sense as an argument to me.

In the "PMEM as filesystem" part, the errors have long-term history,
while in "PMEM as RAM" the memory may be physically the same thing,
but it doesn't have the history and as such may not be prone to
long-term errors the same way.

So that validly argues that yes, when used as RAM, the likelihood for
errors is much lower because they don't accumulate the same way.

> The driver can also publish an
> attribute to indicate when rep; mov is recoverable, and gate the
> hotplug policy on the result. In my opinion a positive indicator of
> the cpu's ability to recover rep; mov exceptions is a gap that needs
> addressing.

Is there some way to say "don't raise MC for this region"? Or at least
limit it to a nonfatal one?

                 Linus
Dan Williams March 12, 2019, 12:30 a.m. UTC | #6
On Mon, Mar 11, 2019 at 5:08 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Mar 11, 2019 at 8:37 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Another feature the userspace tooling can support for the PMEM as RAM
> > case is the ability to complete an Address Range Scrub of the range
> > before it is added to the core-mm. I.e at least ensure that previously
> > encountered poison is eliminated.
>
> Ok, so this at least makes sense as an argument to me.
>
> In the "PMEM as filesystem" part, the errors have long-term history,
> while in "PMEM as RAM" the memory may be physically the same thing,
> but it doesn't have the history and as such may not be prone to
> long-term errors the same way.
>
> So that validly argues that yes, when used as RAM, the likelihood for
> errors is much lower because they don't accumulate the same way.
>
> > The driver can also publish an
> > attribute to indicate when rep; mov is recoverable, and gate the
> > hotplug policy on the result. In my opinion a positive indicator of
> > the cpu's ability to recover rep; mov exceptions is a gap that needs
> > addressing.
>
> Is there some way to say "don't raise MC for this region"? Or at least
> limit it to a nonfatal one?

I wish, but no. The poison consumption always raises the MC then it's
whether MCI_STATUS_PCC (processor context corrupt) is set as to
whether the cpu indicates it is safe to proceed. There's no way to
indicate, "never set MCI_STATUS_PCC", or silence the exception.
Dan Williams March 15, 2019, 5:33 p.m. UTC | #7
On Mon, Mar 11, 2019 at 5:08 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Mar 11, 2019 at 8:37 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Another feature the userspace tooling can support for the PMEM as RAM
> > case is the ability to complete an Address Range Scrub of the range
> > before it is added to the core-mm. I.e at least ensure that previously
> > encountered poison is eliminated.
>
> Ok, so this at least makes sense as an argument to me.
>
> In the "PMEM as filesystem" part, the errors have long-term history,
> while in "PMEM as RAM" the memory may be physically the same thing,
> but it doesn't have the history and as such may not be prone to
> long-term errors the same way.
>
> So that validly argues that yes, when used as RAM, the likelihood for
> errors is much lower because they don't accumulate the same way.

Hi Linus,

The question about a new enumeration mechanism for this has been
raised, but I don't expect a response before the merge window closes.
While it percolates, how do you want to proceed in the meantime?

The kernel could export it's knowledge of the situation in
/sys/devices/system/cpu/vulnerabilities?

Otherwise, the exposure can be reduced in the volatile-RAM case by
scanning for and clearing errors before it is onlined as RAM. The
userspace tooling for that can be in place before v5.1-final. There's
also runtime notifications of errors via acpi_nfit_uc_error_notify()
from background scrubbers on the DIMM devices. With that mechanism the
kernel could proactively clear newly discovered poison in the volatile
case, but that would be additional development more suitable for v5.2.

I understand the concern, and the need to highlight this issue by
tapping the brakes on feature development, but I don't see PMEM as RAM
making the situation worse when the exposure is also there via DAX in
the PMEM case. Volatile-RAM is arguably a safer use case since it's
possible to repair pages where the persistent case needs active
application coordination.

Please take another look at merging this for v5.1, or otherwise let me
know what software changes you'd like to see to move this forward. I'm
also open to the idea of just teaching memcpy_mcsafe() to use rep; mov
as if it was always recoverable and relying on the error being mapped
out after reboot if it was not recoverable. At reboot the driver gets
notification of physical addresses that caused a previous crash so
that software can avoid a future consumption.

git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/devdax-for-5.1
pr-tracker-bot@kernel.org March 16, 2019, 9:25 p.m. UTC | #8
The pull request you sent on Sun, 10 Mar 2019 12:54:01 -0700:

> git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/devdax-for-5.1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/f67e3fb4891287b8248ebb3320f794b9f5e782d4

Thank you!
Dan Williams May 15, 2019, 8:26 p.m. UTC | #9
On Mon, Mar 11, 2019 at 5:08 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Mar 11, 2019 at 8:37 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Another feature the userspace tooling can support for the PMEM as RAM
> > case is the ability to complete an Address Range Scrub of the range
> > before it is added to the core-mm. I.e at least ensure that previously
> > encountered poison is eliminated.
>
> Ok, so this at least makes sense as an argument to me.
>
> In the "PMEM as filesystem" part, the errors have long-term history,
> while in "PMEM as RAM" the memory may be physically the same thing,
> but it doesn't have the history and as such may not be prone to
> long-term errors the same way.
>
> So that validly argues that yes, when used as RAM, the likelihood for
> errors is much lower because they don't accumulate the same way.

In case anyone is looking for the above mentioned tooling for use with
the v5.1 kernel, Vishal has released ndctl-v65 with the new
"clear-errors" command [1].

[1]: https://pmem.io/ndctl/ndctl-clear-errors.html