mbox series

[RFC,00/18] Pkernfs: Support persistence for live update

Message ID 20240205120203.60312-1-jgowans@amazon.com (mailing list archive)
Headers show
Series Pkernfs: Support persistence for live update | expand

Message

Gowans, James Feb. 5, 2024, 12:01 p.m. UTC
This RFC is to solicit feedback on the approach of implementing support for live
update via an in-memory filesystem responsible for storing all live update state
as files in the filesystem.

Hypervisor live update is a mechanism to support updating a hypervisor via kexec
in a way that has limited impact to running virtual machines. This is done by
pausing/serialising running VMs, kexec-ing into a new kernel, starting new VMM
processes and then deserialising/resuming the VMs so that they continue running
from where they were. Virtual machines can have PCI devices passed through and
in order to support live update it’s necessary to persist the IOMMU page tables
so that the devices can continue to do DMA to guest RAM during kexec.

This RFC is a follow-on from a discussion held during LPC 2023 KVM MC
which explored ways in which the live update problem could be tackled;
this was one of them:
https://lpc.events/event/17/contributions/1485/

The approach sketched out in this RFC introduces a new in-memory filesystem,
pkernfs. Pkernfs takes over ownership separate from Linux memory
management system RAM which is carved out from the normal MM allocator
and donated to pkernfs. Files are created in pkernfs for a few purposes:
There are a few things that need to be preserved and re-hydrated after
kexec to support this:

* Guest memory: to be able to restore the VM its memory must be
preserved.  This is achieved by using a regular file in pkernfs for guest RAM.
As this guest RAM is not part of the normal linux core mm allocator and
has no struct pages, it can be removed from the direct map which
improves security posture for guest RAM. Similar to memfd_secret.

* IOMMU root page tables: for the IOMMU to have any ability to do DMA
during kexec it needs root page tables to look up per-domain page
tables. IOMMU root page tables are stored in a special path in pkernfs:
iommu/root-pgtables.  The intel IOMMU driver is modified to hook into
pkernfs to get the chunk of memory that it can use to allocate root
pgtables.

* IOMMU domain page tables: in order for VM-initiated DMA operations to
continue running while kexec is happening the IOVA to PA address
translations for persisted devices needs to continue to work. Similar to
root pgtables the per-domain page tables for persisted devices are
allocated from a pkernfs file so they they are also persisted across
kexec. This is done by using pkernfs files for IOMMU domain page
tables. Not all devices are persistent, so VFIO is updated to support
defining persistent page tables on passed through devices.

* Updates to IOMMU and PCI are needed to make device handover across
kexec work properly. Although not fully complete some of the changed
needed around avoiding device re-setting and re-probing are sketched
in this RFC.

Guest RAM and IOMMU state are just the first two things needed for live update.
Pkernfs opens the door for other kernel state which can improve kexec or add
more capabilities to live update to also be persisted as new files.

The main aspect we’re looking for feedback/opinions on here is the concept of
putting all persistent state in a single filesystem: combining guest RAM and
IOMMU pgtables in one store. Also, the question of a hard separation between
persistent memory and ephemeral memory, compared to allowing arbitrary pages to
be persisted. Pkernfs does it via a hard separation defined at boot time, other
approaches could make the carving out of persistent pages dynamic.

Sign-offs are intentionally omitted to make it clear that this is a
concept sketch RFC and not intended for merging.

On CC are folks who have sent RFCs around this problem space before, as
well as filesystem, kexec, IOMMU, MM and KVM lists and maintainers.

== Alternatives ==

There have been other RFCs which cover some aspect of the live update problem
space. So far, all public approaches with KVM neglected device assignment which
introduces a new dimension of problems. Prior art in this space includes:

1) Kexec Hand Over (KHO) [0]: This is a generic mechanism to pass kernel state
across kexec. It also supports specifying persisted memory page which could be
used to carve out IOMMU pgtable pages from the new kernel’s buddy allocator.

2) PKRAM [1]: Tmpfs-style filesystem which dynamically allocates memory which can
be used for guest RAM and is preserved across kexec by passing a pointer to the
root page.

3) DMEMFS [2]: Similar to pkernfs, DMEMFS is a filesystem on top of a reserved
chunk of memory specified via kernel cmdline parameter. It is not persistent but
aims to remove the need for struct page overhead.

4) Kernel memory pools [3, 4]: These provide a mechanism for kernel modules/drivers
to allocate persistent memory, and restore that memory after kexec. They do do
not attempt to provide the ability to store userspace accessible state or have a
filesystem interface.

== How to use ==

Use the mmemap and pkernfs cmd line args to carve memory out of system RAM and
donate it to pkernfs. For example to carve out 1 GiB of RAM starting at physical
offset 1 GiB:
  memmap=1G%1G nopat pkernfs=1G!1G

Mount pkernfs somewhere, for example:
  mount -t pkernfs /mnt/pkernfs

Allocate a file for guest RAM:
  touch /mnt/pkernfs/guest-ram
  truncate -s 100M /mnt/pkernfs/guest-ram

Add QEMU cmdline option to use this as guest RAM:
  -object memory-backend-file,id=pc.ram,size=100M,mem-path=/mnt/pkernfs/guest-ram,share=yes
  -M q35,memory-backend=pc.ram

Allocate a file for IOMMU domain page tables:
  touch /mnt/pkernfs/iommu/dom-0
  truncate -s 2M /mnt/pkernfs/iommu/dom-0

That file must be supplied to VFIO when creating the IOMMU container, via the
VFIO_CONTAINER_SET_PERSISTENT_PGTABLES ioctl. Example: [4]

After kexec, re-mount pkernfs, re-used those files for guest RAM and IOMMU
state. When doing DMA mapping specify the additional flag
VFIO_DMA_MAP_FLAG_LIVE_UPDATE to indicate that IOVAs are set up already.
Example: [5].

== Limitations ==

This is a RFC design to sketch out the concept so that there can be a discussion
about the general approach. There are many gaps and hacks; the idea is to keep
this RFC as simple as possible. Limitations include:

* Needing to supply the physical memory range for pkernfs as a kernel cmdline
parameter. Better would be to allocate memory for pkernfs dynamically on first
boot and pass that across kexec. Doing so would require additional integration
with memblocks and some ability to pass the dynamically allocated ranges
across. KHO [0] could support this.

* A single filesystem with no support for NUMA awareness. Better would be to
support multiple named pkernfs mounts which can cover different NUMA nodes.

* Skeletal filesystem code. There’s just enough functionality to make it usable to
demonstrate the concept of using files for guest RAM and IOMMU state.

* Use-after-frees for IOMMU mappings. Currently nothing stops the pkernfs guest
RAM files being deleted or resized while IOMMU mappings are set up which would
allow DMA to freed memory. Better integration with guest RAM files and
IOMMU/VFIO is necessary.

* Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
Really we should move the abstraction one level up and make the whole VFIO
container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
container file and all of the DMA mappings inside VFIO would already be set up.

* Inefficient use of filesystem space. Every mappings block is 2 MiB which is both
wasteful and an hard upper limit on file size.

[0] https://lore.kernel.org/kexec/20231213000452.88295-1-graf@amazon.com/
[1] https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com/
[2] https://lkml.org/lkml/2020/12/7/342
[3] https://lore.kernel.org/all/169645773092.11424.7258549771090599226.stgit@skinsburskii./
[4] https://lore.kernel.org/all/2023082506-enchanted-tripping-d1d5@gregkh/#t
[5] https://github.com/jgowans/qemu/commit/e84cfb8186d71f797ef1f72d57d873222a9b479e
[6] https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0


James Gowans (18):
  pkernfs: Introduce filesystem skeleton
  pkernfs: Add persistent inodes hooked into directies
  pkernfs: Define an allocator for persistent pages
  pkernfs: support file truncation
  pkernfs: add file mmap callback
  init: Add liveupdate cmdline param
  pkernfs: Add file type for IOMMU root pgtables
  iommu: Add allocator for pgtables from persistent region
  intel-iommu: Use pkernfs for root/context pgtable pages
  iommu/intel: zap context table entries on kexec
  dma-iommu: Always enable deferred attaches for liveupdate
  pkernfs: Add IOMMU domain pgtables file
  vfio: add ioctl to define persistent pgtables on container
  intel-iommu: Allocate domain pgtable pages from pkernfs
  pkernfs: register device memory for IOMMU domain pgtables
  vfio: support not mapping IOMMU pgtables on live-update
  pci: Don't clear bus master is persistence enabled
  vfio-pci: Assume device working after liveupdate

 drivers/iommu/Makefile           |   1 +
 drivers/iommu/dma-iommu.c        |   2 +-
 drivers/iommu/intel/dmar.c       |   1 +
 drivers/iommu/intel/iommu.c      |  93 +++++++++++++---
 drivers/iommu/intel/iommu.h      |   5 +
 drivers/iommu/iommu.c            |  22 ++--
 drivers/iommu/pgtable_alloc.c    |  43 +++++++
 drivers/iommu/pgtable_alloc.h    |  10 ++
 drivers/pci/pci-driver.c         |   4 +-
 drivers/vfio/container.c         |  27 +++++
 drivers/vfio/pci/vfio_pci_core.c |  20 ++--
 drivers/vfio/vfio.h              |   2 +
 drivers/vfio/vfio_iommu_type1.c  |  51 ++++++---
 fs/Kconfig                       |   1 +
 fs/Makefile                      |   3 +
 fs/pkernfs/Kconfig               |   9 ++
 fs/pkernfs/Makefile              |   6 +
 fs/pkernfs/allocator.c           |  51 +++++++++
 fs/pkernfs/dir.c                 |  43 +++++++
 fs/pkernfs/file.c                |  93 ++++++++++++++++
 fs/pkernfs/inode.c               | 185 +++++++++++++++++++++++++++++++
 fs/pkernfs/iommu.c               | 163 +++++++++++++++++++++++++++
 fs/pkernfs/pkernfs.c             | 115 +++++++++++++++++++
 fs/pkernfs/pkernfs.h             |  61 ++++++++++
 include/linux/init.h             |   1 +
 include/linux/iommu.h            |   6 +-
 include/linux/pkernfs.h          |  38 +++++++
 include/uapi/linux/vfio.h        |  10 ++
 init/main.c                      |  10 ++
 29 files changed, 1029 insertions(+), 47 deletions(-)
 create mode 100644 drivers/iommu/pgtable_alloc.c
 create mode 100644 drivers/iommu/pgtable_alloc.h
 create mode 100644 fs/pkernfs/Kconfig
 create mode 100644 fs/pkernfs/Makefile
 create mode 100644 fs/pkernfs/allocator.c
 create mode 100644 fs/pkernfs/dir.c
 create mode 100644 fs/pkernfs/file.c
 create mode 100644 fs/pkernfs/inode.c
 create mode 100644 fs/pkernfs/iommu.c
 create mode 100644 fs/pkernfs/pkernfs.c
 create mode 100644 fs/pkernfs/pkernfs.h
 create mode 100644 include/linux/pkernfs.h

Comments

Alex Williamson Feb. 5, 2024, 5:10 p.m. UTC | #1
On Mon, 5 Feb 2024 12:01:45 +0000
James Gowans <jgowans@amazon.com> wrote:

> This RFC is to solicit feedback on the approach of implementing support for live
> update via an in-memory filesystem responsible for storing all live update state
> as files in the filesystem.
> 
> Hypervisor live update is a mechanism to support updating a hypervisor via kexec
> in a way that has limited impact to running virtual machines. This is done by
> pausing/serialising running VMs, kexec-ing into a new kernel, starting new VMM
> processes and then deserialising/resuming the VMs so that they continue running
> from where they were. Virtual machines can have PCI devices passed through and
> in order to support live update it’s necessary to persist the IOMMU page tables
> so that the devices can continue to do DMA to guest RAM during kexec.
> 
> This RFC is a follow-on from a discussion held during LPC 2023 KVM MC
> which explored ways in which the live update problem could be tackled;
> this was one of them:
> https://lpc.events/event/17/contributions/1485/
> 
> The approach sketched out in this RFC introduces a new in-memory filesystem,
> pkernfs. Pkernfs takes over ownership separate from Linux memory
> management system RAM which is carved out from the normal MM allocator
> and donated to pkernfs. Files are created in pkernfs for a few purposes:
> There are a few things that need to be preserved and re-hydrated after
> kexec to support this:
> 
> * Guest memory: to be able to restore the VM its memory must be
> preserved.  This is achieved by using a regular file in pkernfs for guest RAM.
> As this guest RAM is not part of the normal linux core mm allocator and
> has no struct pages, it can be removed from the direct map which
> improves security posture for guest RAM. Similar to memfd_secret.
> 
> * IOMMU root page tables: for the IOMMU to have any ability to do DMA
> during kexec it needs root page tables to look up per-domain page
> tables. IOMMU root page tables are stored in a special path in pkernfs:
> iommu/root-pgtables.  The intel IOMMU driver is modified to hook into
> pkernfs to get the chunk of memory that it can use to allocate root
> pgtables.
> 
> * IOMMU domain page tables: in order for VM-initiated DMA operations to
> continue running while kexec is happening the IOVA to PA address
> translations for persisted devices needs to continue to work. Similar to
> root pgtables the per-domain page tables for persisted devices are
> allocated from a pkernfs file so they they are also persisted across
> kexec. This is done by using pkernfs files for IOMMU domain page
> tables. Not all devices are persistent, so VFIO is updated to support
> defining persistent page tables on passed through devices.
> 
> * Updates to IOMMU and PCI are needed to make device handover across
> kexec work properly. Although not fully complete some of the changed
> needed around avoiding device re-setting and re-probing are sketched
> in this RFC.
> 
> Guest RAM and IOMMU state are just the first two things needed for live update.
> Pkernfs opens the door for other kernel state which can improve kexec or add
> more capabilities to live update to also be persisted as new files.
> 
> The main aspect we’re looking for feedback/opinions on here is the concept of
> putting all persistent state in a single filesystem: combining guest RAM and
> IOMMU pgtables in one store. Also, the question of a hard separation between
> persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> be persisted. Pkernfs does it via a hard separation defined at boot time, other
> approaches could make the carving out of persistent pages dynamic.
> 
> Sign-offs are intentionally omitted to make it clear that this is a
> concept sketch RFC and not intended for merging.
> 
> On CC are folks who have sent RFCs around this problem space before, as
> well as filesystem, kexec, IOMMU, MM and KVM lists and maintainers.
> 
> == Alternatives ==
> 
> There have been other RFCs which cover some aspect of the live update problem
> space. So far, all public approaches with KVM neglected device assignment which
> introduces a new dimension of problems. Prior art in this space includes:
> 
> 1) Kexec Hand Over (KHO) [0]: This is a generic mechanism to pass kernel state
> across kexec. It also supports specifying persisted memory page which could be
> used to carve out IOMMU pgtable pages from the new kernel’s buddy allocator.
> 
> 2) PKRAM [1]: Tmpfs-style filesystem which dynamically allocates memory which can
> be used for guest RAM and is preserved across kexec by passing a pointer to the
> root page.
> 
> 3) DMEMFS [2]: Similar to pkernfs, DMEMFS is a filesystem on top of a reserved
> chunk of memory specified via kernel cmdline parameter. It is not persistent but
> aims to remove the need for struct page overhead.
> 
> 4) Kernel memory pools [3, 4]: These provide a mechanism for kernel modules/drivers
> to allocate persistent memory, and restore that memory after kexec. They do do
> not attempt to provide the ability to store userspace accessible state or have a
> filesystem interface.
> 
> == How to use ==
> 
> Use the mmemap and pkernfs cmd line args to carve memory out of system RAM and
> donate it to pkernfs. For example to carve out 1 GiB of RAM starting at physical
> offset 1 GiB:
>   memmap=1G%1G nopat pkernfs=1G!1G
> 
> Mount pkernfs somewhere, for example:
>   mount -t pkernfs /mnt/pkernfs
> 
> Allocate a file for guest RAM:
>   touch /mnt/pkernfs/guest-ram
>   truncate -s 100M /mnt/pkernfs/guest-ram
> 
> Add QEMU cmdline option to use this as guest RAM:
>   -object memory-backend-file,id=pc.ram,size=100M,mem-path=/mnt/pkernfs/guest-ram,share=yes
>   -M q35,memory-backend=pc.ram
> 
> Allocate a file for IOMMU domain page tables:
>   touch /mnt/pkernfs/iommu/dom-0
>   truncate -s 2M /mnt/pkernfs/iommu/dom-0
> 
> That file must be supplied to VFIO when creating the IOMMU container, via the
> VFIO_CONTAINER_SET_PERSISTENT_PGTABLES ioctl. Example: [4]
> 
> After kexec, re-mount pkernfs, re-used those files for guest RAM and IOMMU
> state. When doing DMA mapping specify the additional flag
> VFIO_DMA_MAP_FLAG_LIVE_UPDATE to indicate that IOVAs are set up already.
> Example: [5].
> 
> == Limitations ==
> 
> This is a RFC design to sketch out the concept so that there can be a discussion
> about the general approach. There are many gaps and hacks; the idea is to keep
> this RFC as simple as possible. Limitations include:
> 
> * Needing to supply the physical memory range for pkernfs as a kernel cmdline
> parameter. Better would be to allocate memory for pkernfs dynamically on first
> boot and pass that across kexec. Doing so would require additional integration
> with memblocks and some ability to pass the dynamically allocated ranges
> across. KHO [0] could support this.
> 
> * A single filesystem with no support for NUMA awareness. Better would be to
> support multiple named pkernfs mounts which can cover different NUMA nodes.
> 
> * Skeletal filesystem code. There’s just enough functionality to make it usable to
> demonstrate the concept of using files for guest RAM and IOMMU state.
> 
> * Use-after-frees for IOMMU mappings. Currently nothing stops the pkernfs guest
> RAM files being deleted or resized while IOMMU mappings are set up which would
> allow DMA to freed memory. Better integration with guest RAM files and
> IOMMU/VFIO is necessary.
> 
> * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> Really we should move the abstraction one level up and make the whole VFIO
> container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> container file and all of the DMA mappings inside VFIO would already be set up.

Note that the vfio container is on a path towards deprecation, this
should be refocused on vfio relative to iommufd.  There would need to
be a strong argument for a container/type1 extension to support this,
iommufd would need to be the first class implementation.  Thanks,

Alex
 
> * Inefficient use of filesystem space. Every mappings block is 2 MiB which is both
> wasteful and an hard upper limit on file size.
> 
> [0] https://lore.kernel.org/kexec/20231213000452.88295-1-graf@amazon.com/
> [1] https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com/
> [2] https://lkml.org/lkml/2020/12/7/342
> [3] https://lore.kernel.org/all/169645773092.11424.7258549771090599226.stgit@skinsburskii./
> [4] https://lore.kernel.org/all/2023082506-enchanted-tripping-d1d5@gregkh/#t
> [5] https://github.com/jgowans/qemu/commit/e84cfb8186d71f797ef1f72d57d873222a9b479e
> [6] https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0
> 
> 
> James Gowans (18):
>   pkernfs: Introduce filesystem skeleton
>   pkernfs: Add persistent inodes hooked into directies
>   pkernfs: Define an allocator for persistent pages
>   pkernfs: support file truncation
>   pkernfs: add file mmap callback
>   init: Add liveupdate cmdline param
>   pkernfs: Add file type for IOMMU root pgtables
>   iommu: Add allocator for pgtables from persistent region
>   intel-iommu: Use pkernfs for root/context pgtable pages
>   iommu/intel: zap context table entries on kexec
>   dma-iommu: Always enable deferred attaches for liveupdate
>   pkernfs: Add IOMMU domain pgtables file
>   vfio: add ioctl to define persistent pgtables on container
>   intel-iommu: Allocate domain pgtable pages from pkernfs
>   pkernfs: register device memory for IOMMU domain pgtables
>   vfio: support not mapping IOMMU pgtables on live-update
>   pci: Don't clear bus master is persistence enabled
>   vfio-pci: Assume device working after liveupdate
> 
>  drivers/iommu/Makefile           |   1 +
>  drivers/iommu/dma-iommu.c        |   2 +-
>  drivers/iommu/intel/dmar.c       |   1 +
>  drivers/iommu/intel/iommu.c      |  93 +++++++++++++---
>  drivers/iommu/intel/iommu.h      |   5 +
>  drivers/iommu/iommu.c            |  22 ++--
>  drivers/iommu/pgtable_alloc.c    |  43 +++++++
>  drivers/iommu/pgtable_alloc.h    |  10 ++
>  drivers/pci/pci-driver.c         |   4 +-
>  drivers/vfio/container.c         |  27 +++++
>  drivers/vfio/pci/vfio_pci_core.c |  20 ++--
>  drivers/vfio/vfio.h              |   2 +
>  drivers/vfio/vfio_iommu_type1.c  |  51 ++++++---
>  fs/Kconfig                       |   1 +
>  fs/Makefile                      |   3 +
>  fs/pkernfs/Kconfig               |   9 ++
>  fs/pkernfs/Makefile              |   6 +
>  fs/pkernfs/allocator.c           |  51 +++++++++
>  fs/pkernfs/dir.c                 |  43 +++++++
>  fs/pkernfs/file.c                |  93 ++++++++++++++++
>  fs/pkernfs/inode.c               | 185 +++++++++++++++++++++++++++++++
>  fs/pkernfs/iommu.c               | 163 +++++++++++++++++++++++++++
>  fs/pkernfs/pkernfs.c             | 115 +++++++++++++++++++
>  fs/pkernfs/pkernfs.h             |  61 ++++++++++
>  include/linux/init.h             |   1 +
>  include/linux/iommu.h            |   6 +-
>  include/linux/pkernfs.h          |  38 +++++++
>  include/uapi/linux/vfio.h        |  10 ++
>  init/main.c                      |  10 ++
>  29 files changed, 1029 insertions(+), 47 deletions(-)
>  create mode 100644 drivers/iommu/pgtable_alloc.c
>  create mode 100644 drivers/iommu/pgtable_alloc.h
>  create mode 100644 fs/pkernfs/Kconfig
>  create mode 100644 fs/pkernfs/Makefile
>  create mode 100644 fs/pkernfs/allocator.c
>  create mode 100644 fs/pkernfs/dir.c
>  create mode 100644 fs/pkernfs/file.c
>  create mode 100644 fs/pkernfs/inode.c
>  create mode 100644 fs/pkernfs/iommu.c
>  create mode 100644 fs/pkernfs/pkernfs.c
>  create mode 100644 fs/pkernfs/pkernfs.h
>  create mode 100644 include/linux/pkernfs.h
>
Jason Gunthorpe Feb. 5, 2024, 5:42 p.m. UTC | #2
On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote:

> The main aspect we’re looking for feedback/opinions on here is the concept of
> putting all persistent state in a single filesystem: combining guest RAM and
> IOMMU pgtables in one store. Also, the question of a hard separation between
> persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> be persisted. Pkernfs does it via a hard separation defined at boot time, other
> approaches could make the carving out of persistent pages dynamic.

I think if you are going to attempt something like this then the end
result must bring things back to having the same data structures fully
restored.

It is fine that the pkernfs holds some persistant memory that
guarentees the IOMMU can remain programmed and the VM pages can become
fixed across the kexec

But once the VMM starts to restore it self we need to get back to the
original configuration:
 - A mmap that points to the VM's physical pages
 - An iommufd IOAS that points to the above mmap
 - An iommufd HWPT that represents that same mapping
 - An iommu_domain programmed into HW that the HWPT

Ie you can't just reboot and leave the IOMMU hanging out in some
undefined land - especially in latest kernels!

For vt-d you need to retain the entire root table and all the required
context entries too, The restarting iommu needs to understand that it
has to "restore" a temporary iommu_domain from the pkernfs.

You can later reconstitute a proper iommu_domain from the VMM and
atomic switch.

So, I'm surprised to see this approach where things just live forever
in the kernfs, I don't see how "restore" is going to work very well
like this.

I would think that a save/restore mentalitity would make more
sense. For instance you could make a special iommu_domain that is fixed
and lives in the pkernfs. The operation would be to copy from the live
iommu_domain to the fixed one and then replace the iommu HW to the
fixed one.

In the post-kexec world the iommu would recreate that special domain
and point the iommu at it. (copying the root and context descriptions
out of the pkernfs). Then somehow that would get into iommufd and VFIO
so that it could take over that special mapping during its startup.

Then you'd build the normal operating ioas and hwpt (with all the
right page refcounts/etc) then switch to it and free the pkernfs
memory.

It seems alot less invasive to me. The special case is clearly a
special case and doesn't mess up the normal operation of the drivers.

It becomes more like kdump where the iommu driver is running in a
fairly normal mode, just with some stuff copied from the prior kernel.

Your text spent alot of time talking about the design of how the pages
persist, which is interesting, but it seems like only a small part of
the problem. Actually using that mechanism in a sane way and cover all
the functional issues in the HW drivers is going to be really
challenging.

> * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> Really we should move the abstraction one level up and make the whole VFIO
> container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> container file and all of the DMA mappings inside VFIO would already be set up.

I doubt this.. It probably needs to be much finer grained actually,
otherwise you are going to be serializing everything. Somehow I think
you are better to serialize a minimum and try to reconstruct
everything else in userspace. Like conserving iommufd IDs would be a
huge PITA.

There are also going to be lots of security questions here, like we
can't just let userspace feed in any garbage and violate vfio and
iommu invariants.

Jason
Luca Boccassi Feb. 6, 2024, 2:55 p.m. UTC | #3
> Also, the question of a hard separation between
> persistent memory and ephemeral memory, compared to allowing
> arbitrary pages to
> be persisted. Pkernfs does it via a hard separation defined at boot
> time, other
> approaches could make the carving out of persistent pages dynamic.

Speaking from experience here - in Azure (Boost) we have been using
hard-carved out memory areas (DAX devices with ranges configured via
DTB) for persisting state across kexec for ~5 years or so. In a
nutshell: don't, it's a mistake.

It's a constant and consistence source of problems, headaches, issues
and workarounds piled upon workarounds, held together with duct tape
and prayers. It's just not flexible enough for any modern system. For
example, unless _all_ the machines are ridicolously overprovisioned in
terms of memory capacity (and guaranteed to remain so, forever), you
end up wasting enormous amounts of memory.

In Azure we are very much interested in a nice, well-abstracted, first-
class replacement for that setup that allows persisting data across
kexec, and in systemd userspace we'd very much want to use it as well,
but it really, really needs to be dynamic, and avoid the pitfall of
hard-configured carved out chunk.
Gowans, James Feb. 7, 2024, 2:45 p.m. UTC | #4
Hi Jason,

Thanks for this great feedback on the approach - it's exactly the sort
of thing we were looking for.

On Mon, 2024-02-05 at 13:42 -0400, Jason Gunthorpe wrote:
> 
> On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote:
> 
> > The main aspect we’re looking for feedback/opinions on here is the concept of
> > putting all persistent state in a single filesystem: combining guest RAM and
> > IOMMU pgtables in one store. Also, the question of a hard separation between
> > persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> > be persisted. Pkernfs does it via a hard separation defined at boot time, other
> > approaches could make the carving out of persistent pages dynamic.
> 
> I think if you are going to attempt something like this then the end
> result must bring things back to having the same data structures fully
> restored.
> 
> It is fine that the pkernfs holds some persistant memory that
> guarentees the IOMMU can remain programmed and the VM pages can become
> fixed across the kexec
> 
> But once the VMM starts to restore it self we need to get back to the
> original configuration:
>  - A mmap that points to the VM's physical pages
>  - An iommufd IOAS that points to the above mmap
>  - An iommufd HWPT that represents that same mapping
>  - An iommu_domain programmed into HW that the HWPT

(A quick note on iommufd vs VFIO: I'll still keep referring to VFIO for
now because that's what I know, but will explore iommufd more and reply
in more detail about iommufd in the other email thread.)

How much of this do you think should be done automatically, vs how much
should userspace need to drive? With this RFC userspace basically re-
drives everything, including re-injecting the file containing the
persistent page tables into the IOMMU domain via VFIO.

Part of the reason is simplicity, to avoid having auto-deserialise code
paths in the drivers and modules. Another part of the reason so that
userspace can get FD handles on the resources. Typically FDs are
returned by doing actions like creating VFIO containers. If we make all
that automatic then there needs to be some other mechanism for auto-
restored resources to present themselves to userspace so that userspace
can discover and pick them up again.

One possible way to do this would be to populate a bunch of files in
procfs for each persisted IOMMU domain that allows userspace to discover
and pick it up.

Can you expand on what you mean by "A mmap that points to the VM's
physical pages?" Are you suggesting that the QEMU process automatically
gets something appearing in it's address space? Part of the live update
process involves potentially changing the userspace binaries: doing
kexec and booting a new system is an opportunity to boot new versions of
the userspace binary. So we shouldn't try to preserve too much of
userspace state; it's better to let it re-create internal data
structures do fresh mmaps.

What I'm really asking is: do you have a specific suggestion about how
these persistent resources should present themselves to userspace and
how userspace can discover them and pick them up?

> 
> Ie you can't just reboot and leave the IOMMU hanging out in some
> undefined land - especially in latest kernels!

Not too sure what you mean by "undefined land" - the idea is that the
IOMMU keeps doing what it was going until userspace comes along re-
creates the handles to the IOMMU at which point it can do modifications
like change mappings or tear the domain down. This is what deferred
attached gives us, I believe, and why I had to change it to be enabled.
Just leave the IOMMU domain alone until userspace re-creates it with the
original tables.
Perhaps I'm missing your point. :-)

> 
> For vt-d you need to retain the entire root table and all the required
> context entries too, The restarting iommu needs to understand that it
> has to "restore" a temporary iommu_domain from the pkernfs.
> You can later reconstitute a proper iommu_domain from the VMM and
> atomic switch.

Why does it need to go via a temporary domain? The current approach is
just to leave the IOMMU domain running as-is via deferred attached, and
later when userspace starts up it will create the iommu_domain backed by
the same persistent page tables.
> 
> So, I'm surprised to see this approach where things just live forever
> in the kernfs, I don't see how "restore" is going to work very well
> like this.

Can you expand on why the suggested restore path will be problematic? In
summary the idea is to re-create all of the "ephemeral" data structures
by re-doing ioctls like MAP_DMA, but keeping the persistent IOMMU
root/context tables pointed at the original persistent page tables. The
ephemeral data structures are re-created in userspace but the persistent
page tables left alone. This is of course dependent on userspace re-
creating things *correctly* - it can easily do the wrong thing. Perhaps
this is the issue? Or is there a problem even if userspace is sane.

> I would think that a save/restore mentalitity would make more
> sense. For instance you could make a special iommu_domain that is fixed
> and lives in the pkernfs. The operation would be to copy from the live
> iommu_domain to the fixed one and then replace the iommu HW to the
> fixed one.
> 
> In the post-kexec world the iommu would recreate that special domain
> and point the iommu at it. (copying the root and context descriptions
> out of the pkernfs). Then somehow that would get into iommufd and VFIO
> so that it could take over that special mapping during its startup.

The save and restore model is super interesting - I'm keen to discuss
this as an alternative. You're suggesting that IOMMU driver have a
serialise phase just before kexec where it dumps everything into
persistent memory and then after kexec pulls it back into ephemeral
memory. That's probably do-able, but it may increase the critical
section latency of live update (every millisecond counts!) and I'm also
not too sure what that buys compared to always working with persistent
memory and just always being in a state where persistent data is always
being used and can be picked up as-is.

However, the idea of a serialise and deserialise operation is relevant
to a possible alternative to this RFC. My colleague Alex Graf is working
on a framework called Kexec Hand Over (KHO):
https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/#r
That allows drivers/modules to mark arbitrary memory pages as persistent
(ie: not allocatable by next kernel) and to pass over some serialised
state across kexec.
An alternative to IOMMU domain persistence could be to use KHO to mark
the IOMMU root, context and domain page table pages as persistent via
KHO.

> 
> Then you'd build the normal operating ioas and hwpt (with all the
> right page refcounts/etc) then switch to it and free the pkernfs
> memory.
> 
> It seems alot less invasive to me. The special case is clearly a
> special case and doesn't mess up the normal operation of the drivers.

Yes, a serialise/deserialise approach does have this distinct advantage
of not needing to change the alloc/free code paths. Pkernfs requires a
shim in the allocator to use persistent memory.


> 
> > * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> > Really we should move the abstraction one level up and make the whole VFIO
> > container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> > container file and all of the DMA mappings inside VFIO would already be set up.
> 
> I doubt this.. It probably needs to be much finer grained actually,
> otherwise you are going to be serializing everything. Somehow I think
> you are better to serialize a minimum and try to reconstruct
> everything else in userspace. Like conserving iommufd IDs would be a
> huge PITA.
> 
> There are also going to be lots of security questions here, like we
> can't just let userspace feed in any garbage and violate vfio and
> iommu invariants.

Right! This is definitely one of the big gaps at the moment: this
approach requires that VFIO has the same state re-driven into it from
userspace so that the persistent and ephemeral data match. If userspace
does something dodgy, well, it may cause problems. :-)
That's exactly why I thought we should move the abstraction up to a
level that doesn't depend on userspace re-driving data. It sounds like
you were suggesting similar in the first part of your comment, but I
didn't fully understand how you'd like to see it presented to userspace.

JG
Gowans, James Feb. 7, 2024, 2:56 p.m. UTC | #5
On Mon, 2024-02-05 at 10:10 -0700, Alex Williamson wrote:
> > * Needing to drive and re-hydrate the IOMMU page tables by defining
> > an IOMMU file.
> > Really we should move the abstraction one level up and make the
> > whole VFIO
> > container persistent via a pkernfs file. That way you’d "just" re-
> > open the VFIO
> > container file and all of the DMA mappings inside VFIO would already
> > be set up.
> 
> Note that the vfio container is on a path towards deprecation, this
> should be refocused on vfio relative to iommufd.  There would need to
> be a strong argument for a container/type1 extension to support this,
> iommufd would need to be the first class implementation.  Thanks,

Ack! When I first started putting pkernfs together, iommufd wasn't
integrated into QEMU yet, hence I stuck with VFIO for this PoC.
I'm thrilled to see that iommufd now seems to be integrated in QEMU!
Good opportunity to get to grips with it.

The VFIO-specific part of this patch is essentially ioctls on the
*container* to be able to:

1. define persistent page tables (PPTs) on the containers so that those
PPTs are used by the IOMMU domain and hence by all devices added to that
container.
https://github.com/jgowans/qemu/commit/e84cfb8186d71f797ef1f72d57d873222a9b479e

2. Tell VFIO to avoid mapping the memory in again after live update
because it already exists.
https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0
(the above flag should only be set *after* live update...).

Do you have a rough suggestion about how similar could be done with
iommufd?

JG
Jason Gunthorpe Feb. 7, 2024, 3:22 p.m. UTC | #6
On Wed, Feb 07, 2024 at 02:45:42PM +0000, Gowans, James wrote:
> Hi Jason,
> 
> Thanks for this great feedback on the approach - it's exactly the sort
> of thing we were looking for.
> 
> On Mon, 2024-02-05 at 13:42 -0400, Jason Gunthorpe wrote:
> > 
> > On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote:
> > 
> > > The main aspect we’re looking for feedback/opinions on here is the concept of
> > > putting all persistent state in a single filesystem: combining guest RAM and
> > > IOMMU pgtables in one store. Also, the question of a hard separation between
> > > persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> > > be persisted. Pkernfs does it via a hard separation defined at boot time, other
> > > approaches could make the carving out of persistent pages dynamic.
> > 
> > I think if you are going to attempt something like this then the end
> > result must bring things back to having the same data structures fully
> > restored.
> > 
> > It is fine that the pkernfs holds some persistant memory that
> > guarentees the IOMMU can remain programmed and the VM pages can become
> > fixed across the kexec
> > 
> > But once the VMM starts to restore it self we need to get back to the
> > original configuration:
> >  - A mmap that points to the VM's physical pages
> >  - An iommufd IOAS that points to the above mmap
> >  - An iommufd HWPT that represents that same mapping
> >  - An iommu_domain programmed into HW that the HWPT
> 
> (A quick note on iommufd vs VFIO: I'll still keep referring to VFIO for
> now because that's what I know, but will explore iommufd more and reply
> in more detail about iommufd in the other email thread.)
> 
> How much of this do you think should be done automatically, vs how much
> should userspace need to drive? With this RFC userspace basically re-
> drives everything, including re-injecting the file containing the
> persistent page tables into the IOMMU domain via VFIO.

My guess is that fully automatically is hard/impossible as there is
lots and lots of related state that has to come back. Like how do you
get all the internal iommufs IOAS related datastructures
automatically. Seems way too hard.

Feels simpler to have userspace redo whatever setup was needed to get
back to the right spot.

> Part of the reason is simplicity, to avoid having auto-deserialise code
> paths in the drivers and modules. Another part of the reason so that
> userspace can get FD handles on the resources. Typically FDs are
> returned by doing actions like creating VFIO containers. If we make all
> that automatic then there needs to be some other mechanism for auto-
> restored resources to present themselves to userspace so that userspace
> can discover and pick them up again.

Right, there is lots of state all over the place that would hard to
just re-materialize.

> Can you expand on what you mean by "A mmap that points to the VM's
> physical pages?" Are you suggesting that the QEMU process automatically
> gets something appearing in it's address space? Part of the live update
> process involves potentially changing the userspace binaries: doing
> kexec and booting a new system is an opportunity to boot new versions of
> the userspace binary. So we shouldn't try to preserve too much of
> userspace state; it's better to let it re-create internal data
> structures do fresh mmaps.

I expect the basic flow would be like:

 Starting kernel
   - run the VMM
   - Allocate VM memory in the pkernfs
   - mmap that VM memory
   - Attach the VM memory to KVM
   - Attach the VM memory mmap to IOMMUFD
   - Operate the VM

 Suspending the kernel
   - Stop touching iommufd
   - Freeze changes to the IOMMU, and move its working memory to pkernfs
   - Exit the kernel

 New kernel
   - Recover the frozen IOMMU back to partially running, like crash
     dump. Continue to use some of the working memory in the pkernfs
   - run the new VMM. Some IOMMU_DOMAIN_PKERNFS thing to represent
     this state
   - mmap the VM memory
   - Get KVM going again
   - Attach the new VMM's VM memory mmap to IOMMUFD
   - Replace the iommu partial configuration with a full configuration
   - Free the pkernfs iommu related memory

> What I'm really asking is: do you have a specific suggestion about how
> these persistent resources should present themselves to userspace and
> how userspace can discover them and pick them up?

The only tricky bit in the above is having VFIO know it should leave
the iommu and PCI device state alone when the VFIO cdev is first
opened. Otherwise everything else is straightforward.

Presumably vfio would know it inherited a pkernfs blob and would do
the right stuff. May be some uAPI fussing there to handshake that
properly

Once VFIO knows this it can operate iommufd to conserve the
IOMMU_DOMAIN_PKERNFS as well.

> > Ie you can't just reboot and leave the IOMMU hanging out in some
> > undefined land - especially in latest kernels!
>
> Not too sure what you mean by "undefined land" - the idea is that the
> IOMMU keeps doing what it was going until userspace comes along re-

In terms of how the iommu subystems understands what the iommu is
doing. The iommu subsystem now forces the iommu into defined states as
part of its startup and you need an explicit defined state which means
"continuing to use the pkernfs saved state" which the iommu driver
deliberately enters.

> creates the handles to the IOMMU at which point it can do modifications
> like change mappings or tear the domain down. This is what deferred
> attached gives us, I believe, and why I had to change it to be
> enabled.

VFIO doesn't trigger deferred attach at all, that patch made no sense.

> > For vt-d you need to retain the entire root table and all the required
> > context entries too, The restarting iommu needs to understand that it
> > has to "restore" a temporary iommu_domain from the pkernfs.
> > You can later reconstitute a proper iommu_domain from the VMM and
> > atomic switch.
> 
> Why does it need to go via a temporary domain? 

Because that is the software model we have now. You must be explicit
not in some lalal undefined land of "i don't know WTF is going on but
if I squint this is doing some special thing!" That concept is dead in
the iommu subsystem, you must be explicit.

If the iommu is translating through special page tables stored in a
pkernfs then you need a IOMMU_DOMAIN_PKERNFS to represent that
behavior.

> > So, I'm surprised to see this approach where things just live forever
> > in the kernfs, I don't see how "restore" is going to work very well
> > like this.
> 
> Can you expand on why the suggested restore path will be problematic? In
> summary the idea is to re-create all of the "ephemeral" data structures
> by re-doing ioctls like MAP_DMA, but keeping the persistent IOMMU
> root/context tables pointed at the original persistent page tables. The
> ephemeral data structures are re-created in userspace but the persistent
> page tables left alone. This is of course dependent on userspace re-
> creating things *correctly* - it can easily do the wrong thing. Perhaps
> this is the issue? Or is there a problem even if userspace is sane.

Because how do you regain control of the iommu in a fully configured
way with all the right pin counts and so forth? It seems impossible
like this, all that information is washed away during the kexec.

It seems easier if the pkernfs version of the iommu configuration is
temporary and very special. The normal working mode is just exactly as
today.

> > I would think that a save/restore mentalitity would make more
> > sense. For instance you could make a special iommu_domain that is fixed
> > and lives in the pkernfs. The operation would be to copy from the live
> > iommu_domain to the fixed one and then replace the iommu HW to the
> > fixed one.
> > 
> > In the post-kexec world the iommu would recreate that special domain
> > and point the iommu at it. (copying the root and context descriptions
> > out of the pkernfs). Then somehow that would get into iommufd and VFIO
> > so that it could take over that special mapping during its startup.
> 
> The save and restore model is super interesting - I'm keen to discuss
> this as an alternative. You're suggesting that IOMMU driver have a
> serialise phase just before kexec where it dumps everything into
> persistent memory and then after kexec pulls it back into ephemeral
> memory. That's probably do-able, but it may increase the critical
> section latency of live update (every millisecond counts!) 

Suspending preperation can be done before stopping the vCPUs. You have
to commit to freezing the iommu which only means things like memory
hotplug can't progress. So it isn't critical path

Same on resume, you can resum kvm and the vCPUs and leave the IOMMU in
its suspended state while you work on returning it to normal
operation. Again only memory hotplug becomes blocked so it isn't
critical path.

> and I'm also not too sure what that buys compared to always working
> with persistent memory and just always being in a state where
> persistent data is always being used and can be picked up as-is.

You don't mess up the entire driver and all of its memory management,
and end up with a problem where you can't actually properly restore it
anyhow :)

> However, the idea of a serialise and deserialise operation is relevant
> to a possible alternative to this RFC. My colleague Alex Graf is working
> on a framework called Kexec Hand Over (KHO):
> https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/#r
> That allows drivers/modules to mark arbitrary memory pages as persistent
> (ie: not allocatable by next kernel) and to pass over some serialised
> state across kexec.
> An alternative to IOMMU domain persistence could be to use KHO to mark
> the IOMMU root, context and domain page table pages as persistent via
> KHO.

IMHO it doesn't matter how you get the memory across the kexec, you
still have to answer all these questions about how does the new kernel
actually keep working with this inherited data, and how does it
transform the inherited data into operating data that is properly
situated in the kernel data structures.

You can't just startup iommufd and point it at a set of io page tables
that something else populated. It is fundamentally wrong and would
lead to corrupting the mm's pin counts.

> > > * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> > > Really we should move the abstraction one level up and make the whole VFIO
> > > container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> > > container file and all of the DMA mappings inside VFIO would already be set up.
> > 
> > I doubt this.. It probably needs to be much finer grained actually,
> > otherwise you are going to be serializing everything. Somehow I think
> > you are better to serialize a minimum and try to reconstruct
> > everything else in userspace. Like conserving iommufd IDs would be a
> > huge PITA.
> > 
> > There are also going to be lots of security questions here, like we
> > can't just let userspace feed in any garbage and violate vfio and
> > iommu invariants.
> 
> Right! This is definitely one of the big gaps at the moment: this
> approach requires that VFIO has the same state re-driven into it from
> userspace so that the persistent and ephemeral data match. If userspace
> does something dodgy, well, it may cause problems. :-)
> That's exactly why I thought we should move the abstraction up to a
> level that doesn't depend on userspace re-driving data. It sounds like
> you were suggesting similar in the first part of your comment, but I
> didn't fully understand how you'd like to see it presented to userspace.

I'd think you end up with some scenario where the pkernfs data has to
be trusted and sealed somehow before vfio would understand it. Ie you
have to feed it into vfio/etc via kexec only.

From a security perspective it does seem horribly wrong to expose such
sensitive data in a filesystem API where there are API surfaces that
would let userspace manipulate it.

At least from the iommu/vfio perspective:

The trusted data should originate inside a signed kernel only.

The signed kernel should prevent userspace from reading or writing it

The next kernel should trust that the prior kernel put the correct
data in there. There should be no option for the next kernel userspace
to read or write the data.

The next kernel can automatically affiliate things with the trusted
inherited data that it knows was passed from the prior signed kernel.
eg autocreate a IOMMU_DOMAIN_PKERNFS, tweak VFIO, etc.

I understand the appeal of making a pkernfs to hold the VM's memory
pages, but it doesn't seem so secure for kernel internal data
strucures..

Jason
Jason Gunthorpe Feb. 7, 2024, 3:28 p.m. UTC | #7
On Wed, Feb 07, 2024 at 02:56:33PM +0000, Gowans, James wrote:
> 2. Tell VFIO to avoid mapping the memory in again after live update
> because it already exists.
> https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0
> (the above flag should only be set *after* live update...).

Definately no to that entire idea. It completely breaks how the memory
lifetime model works in iommufd.

iommufd has to re-establish its pins, and has to rebuild all its
mapping data structures. Otherwise it won't work correctly at all.

This is what I was saying in the other thread, you can't just ignore
fully restoring the iommu environment.

The end goal must be to have fully reconstituted iommufd with all its
maps, ioas's, and memory pins back to fully normal operation.

IMHO you need to focus on atomic replace where you go from the frozen
pkernfs environment to a live operating enviornment by hitlessly
replacing the IO page table in the HW. Ie going from an
IOMMU_DOMAIN_PKERFS to an IOMMU_DOMAIN_PAGING owned by iommufd that
describes exactly the same translation.

"adopting" an entire io page table with unknown contents, and still
being able to correctly do map/unmap seems way too hard.

Jason