Message ID | 20240205120203.60312-1-jgowans@amazon.com (mailing list archive) |
---|---|
Headers | show |
Series | Pkernfs: Support persistence for live update | expand |
On Mon, 5 Feb 2024 12:01:45 +0000 James Gowans <jgowans@amazon.com> wrote: > This RFC is to solicit feedback on the approach of implementing support for live > update via an in-memory filesystem responsible for storing all live update state > as files in the filesystem. > > Hypervisor live update is a mechanism to support updating a hypervisor via kexec > in a way that has limited impact to running virtual machines. This is done by > pausing/serialising running VMs, kexec-ing into a new kernel, starting new VMM > processes and then deserialising/resuming the VMs so that they continue running > from where they were. Virtual machines can have PCI devices passed through and > in order to support live update it’s necessary to persist the IOMMU page tables > so that the devices can continue to do DMA to guest RAM during kexec. > > This RFC is a follow-on from a discussion held during LPC 2023 KVM MC > which explored ways in which the live update problem could be tackled; > this was one of them: > https://lpc.events/event/17/contributions/1485/ > > The approach sketched out in this RFC introduces a new in-memory filesystem, > pkernfs. Pkernfs takes over ownership separate from Linux memory > management system RAM which is carved out from the normal MM allocator > and donated to pkernfs. Files are created in pkernfs for a few purposes: > There are a few things that need to be preserved and re-hydrated after > kexec to support this: > > * Guest memory: to be able to restore the VM its memory must be > preserved. This is achieved by using a regular file in pkernfs for guest RAM. > As this guest RAM is not part of the normal linux core mm allocator and > has no struct pages, it can be removed from the direct map which > improves security posture for guest RAM. Similar to memfd_secret. > > * IOMMU root page tables: for the IOMMU to have any ability to do DMA > during kexec it needs root page tables to look up per-domain page > tables. IOMMU root page tables are stored in a special path in pkernfs: > iommu/root-pgtables. The intel IOMMU driver is modified to hook into > pkernfs to get the chunk of memory that it can use to allocate root > pgtables. > > * IOMMU domain page tables: in order for VM-initiated DMA operations to > continue running while kexec is happening the IOVA to PA address > translations for persisted devices needs to continue to work. Similar to > root pgtables the per-domain page tables for persisted devices are > allocated from a pkernfs file so they they are also persisted across > kexec. This is done by using pkernfs files for IOMMU domain page > tables. Not all devices are persistent, so VFIO is updated to support > defining persistent page tables on passed through devices. > > * Updates to IOMMU and PCI are needed to make device handover across > kexec work properly. Although not fully complete some of the changed > needed around avoiding device re-setting and re-probing are sketched > in this RFC. > > Guest RAM and IOMMU state are just the first two things needed for live update. > Pkernfs opens the door for other kernel state which can improve kexec or add > more capabilities to live update to also be persisted as new files. > > The main aspect we’re looking for feedback/opinions on here is the concept of > putting all persistent state in a single filesystem: combining guest RAM and > IOMMU pgtables in one store. Also, the question of a hard separation between > persistent memory and ephemeral memory, compared to allowing arbitrary pages to > be persisted. Pkernfs does it via a hard separation defined at boot time, other > approaches could make the carving out of persistent pages dynamic. > > Sign-offs are intentionally omitted to make it clear that this is a > concept sketch RFC and not intended for merging. > > On CC are folks who have sent RFCs around this problem space before, as > well as filesystem, kexec, IOMMU, MM and KVM lists and maintainers. > > == Alternatives == > > There have been other RFCs which cover some aspect of the live update problem > space. So far, all public approaches with KVM neglected device assignment which > introduces a new dimension of problems. Prior art in this space includes: > > 1) Kexec Hand Over (KHO) [0]: This is a generic mechanism to pass kernel state > across kexec. It also supports specifying persisted memory page which could be > used to carve out IOMMU pgtable pages from the new kernel’s buddy allocator. > > 2) PKRAM [1]: Tmpfs-style filesystem which dynamically allocates memory which can > be used for guest RAM and is preserved across kexec by passing a pointer to the > root page. > > 3) DMEMFS [2]: Similar to pkernfs, DMEMFS is a filesystem on top of a reserved > chunk of memory specified via kernel cmdline parameter. It is not persistent but > aims to remove the need for struct page overhead. > > 4) Kernel memory pools [3, 4]: These provide a mechanism for kernel modules/drivers > to allocate persistent memory, and restore that memory after kexec. They do do > not attempt to provide the ability to store userspace accessible state or have a > filesystem interface. > > == How to use == > > Use the mmemap and pkernfs cmd line args to carve memory out of system RAM and > donate it to pkernfs. For example to carve out 1 GiB of RAM starting at physical > offset 1 GiB: > memmap=1G%1G nopat pkernfs=1G!1G > > Mount pkernfs somewhere, for example: > mount -t pkernfs /mnt/pkernfs > > Allocate a file for guest RAM: > touch /mnt/pkernfs/guest-ram > truncate -s 100M /mnt/pkernfs/guest-ram > > Add QEMU cmdline option to use this as guest RAM: > -object memory-backend-file,id=pc.ram,size=100M,mem-path=/mnt/pkernfs/guest-ram,share=yes > -M q35,memory-backend=pc.ram > > Allocate a file for IOMMU domain page tables: > touch /mnt/pkernfs/iommu/dom-0 > truncate -s 2M /mnt/pkernfs/iommu/dom-0 > > That file must be supplied to VFIO when creating the IOMMU container, via the > VFIO_CONTAINER_SET_PERSISTENT_PGTABLES ioctl. Example: [4] > > After kexec, re-mount pkernfs, re-used those files for guest RAM and IOMMU > state. When doing DMA mapping specify the additional flag > VFIO_DMA_MAP_FLAG_LIVE_UPDATE to indicate that IOVAs are set up already. > Example: [5]. > > == Limitations == > > This is a RFC design to sketch out the concept so that there can be a discussion > about the general approach. There are many gaps and hacks; the idea is to keep > this RFC as simple as possible. Limitations include: > > * Needing to supply the physical memory range for pkernfs as a kernel cmdline > parameter. Better would be to allocate memory for pkernfs dynamically on first > boot and pass that across kexec. Doing so would require additional integration > with memblocks and some ability to pass the dynamically allocated ranges > across. KHO [0] could support this. > > * A single filesystem with no support for NUMA awareness. Better would be to > support multiple named pkernfs mounts which can cover different NUMA nodes. > > * Skeletal filesystem code. There’s just enough functionality to make it usable to > demonstrate the concept of using files for guest RAM and IOMMU state. > > * Use-after-frees for IOMMU mappings. Currently nothing stops the pkernfs guest > RAM files being deleted or resized while IOMMU mappings are set up which would > allow DMA to freed memory. Better integration with guest RAM files and > IOMMU/VFIO is necessary. > > * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file. > Really we should move the abstraction one level up and make the whole VFIO > container persistent via a pkernfs file. That way you’d "just" re-open the VFIO > container file and all of the DMA mappings inside VFIO would already be set up. Note that the vfio container is on a path towards deprecation, this should be refocused on vfio relative to iommufd. There would need to be a strong argument for a container/type1 extension to support this, iommufd would need to be the first class implementation. Thanks, Alex > * Inefficient use of filesystem space. Every mappings block is 2 MiB which is both > wasteful and an hard upper limit on file size. > > [0] https://lore.kernel.org/kexec/20231213000452.88295-1-graf@amazon.com/ > [1] https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com/ > [2] https://lkml.org/lkml/2020/12/7/342 > [3] https://lore.kernel.org/all/169645773092.11424.7258549771090599226.stgit@skinsburskii./ > [4] https://lore.kernel.org/all/2023082506-enchanted-tripping-d1d5@gregkh/#t > [5] https://github.com/jgowans/qemu/commit/e84cfb8186d71f797ef1f72d57d873222a9b479e > [6] https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0 > > > James Gowans (18): > pkernfs: Introduce filesystem skeleton > pkernfs: Add persistent inodes hooked into directies > pkernfs: Define an allocator for persistent pages > pkernfs: support file truncation > pkernfs: add file mmap callback > init: Add liveupdate cmdline param > pkernfs: Add file type for IOMMU root pgtables > iommu: Add allocator for pgtables from persistent region > intel-iommu: Use pkernfs for root/context pgtable pages > iommu/intel: zap context table entries on kexec > dma-iommu: Always enable deferred attaches for liveupdate > pkernfs: Add IOMMU domain pgtables file > vfio: add ioctl to define persistent pgtables on container > intel-iommu: Allocate domain pgtable pages from pkernfs > pkernfs: register device memory for IOMMU domain pgtables > vfio: support not mapping IOMMU pgtables on live-update > pci: Don't clear bus master is persistence enabled > vfio-pci: Assume device working after liveupdate > > drivers/iommu/Makefile | 1 + > drivers/iommu/dma-iommu.c | 2 +- > drivers/iommu/intel/dmar.c | 1 + > drivers/iommu/intel/iommu.c | 93 +++++++++++++--- > drivers/iommu/intel/iommu.h | 5 + > drivers/iommu/iommu.c | 22 ++-- > drivers/iommu/pgtable_alloc.c | 43 +++++++ > drivers/iommu/pgtable_alloc.h | 10 ++ > drivers/pci/pci-driver.c | 4 +- > drivers/vfio/container.c | 27 +++++ > drivers/vfio/pci/vfio_pci_core.c | 20 ++-- > drivers/vfio/vfio.h | 2 + > drivers/vfio/vfio_iommu_type1.c | 51 ++++++--- > fs/Kconfig | 1 + > fs/Makefile | 3 + > fs/pkernfs/Kconfig | 9 ++ > fs/pkernfs/Makefile | 6 + > fs/pkernfs/allocator.c | 51 +++++++++ > fs/pkernfs/dir.c | 43 +++++++ > fs/pkernfs/file.c | 93 ++++++++++++++++ > fs/pkernfs/inode.c | 185 +++++++++++++++++++++++++++++++ > fs/pkernfs/iommu.c | 163 +++++++++++++++++++++++++++ > fs/pkernfs/pkernfs.c | 115 +++++++++++++++++++ > fs/pkernfs/pkernfs.h | 61 ++++++++++ > include/linux/init.h | 1 + > include/linux/iommu.h | 6 +- > include/linux/pkernfs.h | 38 +++++++ > include/uapi/linux/vfio.h | 10 ++ > init/main.c | 10 ++ > 29 files changed, 1029 insertions(+), 47 deletions(-) > create mode 100644 drivers/iommu/pgtable_alloc.c > create mode 100644 drivers/iommu/pgtable_alloc.h > create mode 100644 fs/pkernfs/Kconfig > create mode 100644 fs/pkernfs/Makefile > create mode 100644 fs/pkernfs/allocator.c > create mode 100644 fs/pkernfs/dir.c > create mode 100644 fs/pkernfs/file.c > create mode 100644 fs/pkernfs/inode.c > create mode 100644 fs/pkernfs/iommu.c > create mode 100644 fs/pkernfs/pkernfs.c > create mode 100644 fs/pkernfs/pkernfs.h > create mode 100644 include/linux/pkernfs.h >
On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote: > The main aspect we’re looking for feedback/opinions on here is the concept of > putting all persistent state in a single filesystem: combining guest RAM and > IOMMU pgtables in one store. Also, the question of a hard separation between > persistent memory and ephemeral memory, compared to allowing arbitrary pages to > be persisted. Pkernfs does it via a hard separation defined at boot time, other > approaches could make the carving out of persistent pages dynamic. I think if you are going to attempt something like this then the end result must bring things back to having the same data structures fully restored. It is fine that the pkernfs holds some persistant memory that guarentees the IOMMU can remain programmed and the VM pages can become fixed across the kexec But once the VMM starts to restore it self we need to get back to the original configuration: - A mmap that points to the VM's physical pages - An iommufd IOAS that points to the above mmap - An iommufd HWPT that represents that same mapping - An iommu_domain programmed into HW that the HWPT Ie you can't just reboot and leave the IOMMU hanging out in some undefined land - especially in latest kernels! For vt-d you need to retain the entire root table and all the required context entries too, The restarting iommu needs to understand that it has to "restore" a temporary iommu_domain from the pkernfs. You can later reconstitute a proper iommu_domain from the VMM and atomic switch. So, I'm surprised to see this approach where things just live forever in the kernfs, I don't see how "restore" is going to work very well like this. I would think that a save/restore mentalitity would make more sense. For instance you could make a special iommu_domain that is fixed and lives in the pkernfs. The operation would be to copy from the live iommu_domain to the fixed one and then replace the iommu HW to the fixed one. In the post-kexec world the iommu would recreate that special domain and point the iommu at it. (copying the root and context descriptions out of the pkernfs). Then somehow that would get into iommufd and VFIO so that it could take over that special mapping during its startup. Then you'd build the normal operating ioas and hwpt (with all the right page refcounts/etc) then switch to it and free the pkernfs memory. It seems alot less invasive to me. The special case is clearly a special case and doesn't mess up the normal operation of the drivers. It becomes more like kdump where the iommu driver is running in a fairly normal mode, just with some stuff copied from the prior kernel. Your text spent alot of time talking about the design of how the pages persist, which is interesting, but it seems like only a small part of the problem. Actually using that mechanism in a sane way and cover all the functional issues in the HW drivers is going to be really challenging. > * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file. > Really we should move the abstraction one level up and make the whole VFIO > container persistent via a pkernfs file. That way you’d "just" re-open the VFIO > container file and all of the DMA mappings inside VFIO would already be set up. I doubt this.. It probably needs to be much finer grained actually, otherwise you are going to be serializing everything. Somehow I think you are better to serialize a minimum and try to reconstruct everything else in userspace. Like conserving iommufd IDs would be a huge PITA. There are also going to be lots of security questions here, like we can't just let userspace feed in any garbage and violate vfio and iommu invariants. Jason
> Also, the question of a hard separation between > persistent memory and ephemeral memory, compared to allowing > arbitrary pages to > be persisted. Pkernfs does it via a hard separation defined at boot > time, other > approaches could make the carving out of persistent pages dynamic. Speaking from experience here - in Azure (Boost) we have been using hard-carved out memory areas (DAX devices with ranges configured via DTB) for persisting state across kexec for ~5 years or so. In a nutshell: don't, it's a mistake. It's a constant and consistence source of problems, headaches, issues and workarounds piled upon workarounds, held together with duct tape and prayers. It's just not flexible enough for any modern system. For example, unless _all_ the machines are ridicolously overprovisioned in terms of memory capacity (and guaranteed to remain so, forever), you end up wasting enormous amounts of memory. In Azure we are very much interested in a nice, well-abstracted, first- class replacement for that setup that allows persisting data across kexec, and in systemd userspace we'd very much want to use it as well, but it really, really needs to be dynamic, and avoid the pitfall of hard-configured carved out chunk.
Hi Jason, Thanks for this great feedback on the approach - it's exactly the sort of thing we were looking for. On Mon, 2024-02-05 at 13:42 -0400, Jason Gunthorpe wrote: > > On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote: > > > The main aspect we’re looking for feedback/opinions on here is the concept of > > putting all persistent state in a single filesystem: combining guest RAM and > > IOMMU pgtables in one store. Also, the question of a hard separation between > > persistent memory and ephemeral memory, compared to allowing arbitrary pages to > > be persisted. Pkernfs does it via a hard separation defined at boot time, other > > approaches could make the carving out of persistent pages dynamic. > > I think if you are going to attempt something like this then the end > result must bring things back to having the same data structures fully > restored. > > It is fine that the pkernfs holds some persistant memory that > guarentees the IOMMU can remain programmed and the VM pages can become > fixed across the kexec > > But once the VMM starts to restore it self we need to get back to the > original configuration: > - A mmap that points to the VM's physical pages > - An iommufd IOAS that points to the above mmap > - An iommufd HWPT that represents that same mapping > - An iommu_domain programmed into HW that the HWPT (A quick note on iommufd vs VFIO: I'll still keep referring to VFIO for now because that's what I know, but will explore iommufd more and reply in more detail about iommufd in the other email thread.) How much of this do you think should be done automatically, vs how much should userspace need to drive? With this RFC userspace basically re- drives everything, including re-injecting the file containing the persistent page tables into the IOMMU domain via VFIO. Part of the reason is simplicity, to avoid having auto-deserialise code paths in the drivers and modules. Another part of the reason so that userspace can get FD handles on the resources. Typically FDs are returned by doing actions like creating VFIO containers. If we make all that automatic then there needs to be some other mechanism for auto- restored resources to present themselves to userspace so that userspace can discover and pick them up again. One possible way to do this would be to populate a bunch of files in procfs for each persisted IOMMU domain that allows userspace to discover and pick it up. Can you expand on what you mean by "A mmap that points to the VM's physical pages?" Are you suggesting that the QEMU process automatically gets something appearing in it's address space? Part of the live update process involves potentially changing the userspace binaries: doing kexec and booting a new system is an opportunity to boot new versions of the userspace binary. So we shouldn't try to preserve too much of userspace state; it's better to let it re-create internal data structures do fresh mmaps. What I'm really asking is: do you have a specific suggestion about how these persistent resources should present themselves to userspace and how userspace can discover them and pick them up? > > Ie you can't just reboot and leave the IOMMU hanging out in some > undefined land - especially in latest kernels! Not too sure what you mean by "undefined land" - the idea is that the IOMMU keeps doing what it was going until userspace comes along re- creates the handles to the IOMMU at which point it can do modifications like change mappings or tear the domain down. This is what deferred attached gives us, I believe, and why I had to change it to be enabled. Just leave the IOMMU domain alone until userspace re-creates it with the original tables. Perhaps I'm missing your point. :-) > > For vt-d you need to retain the entire root table and all the required > context entries too, The restarting iommu needs to understand that it > has to "restore" a temporary iommu_domain from the pkernfs. > You can later reconstitute a proper iommu_domain from the VMM and > atomic switch. Why does it need to go via a temporary domain? The current approach is just to leave the IOMMU domain running as-is via deferred attached, and later when userspace starts up it will create the iommu_domain backed by the same persistent page tables. > > So, I'm surprised to see this approach where things just live forever > in the kernfs, I don't see how "restore" is going to work very well > like this. Can you expand on why the suggested restore path will be problematic? In summary the idea is to re-create all of the "ephemeral" data structures by re-doing ioctls like MAP_DMA, but keeping the persistent IOMMU root/context tables pointed at the original persistent page tables. The ephemeral data structures are re-created in userspace but the persistent page tables left alone. This is of course dependent on userspace re- creating things *correctly* - it can easily do the wrong thing. Perhaps this is the issue? Or is there a problem even if userspace is sane. > I would think that a save/restore mentalitity would make more > sense. For instance you could make a special iommu_domain that is fixed > and lives in the pkernfs. The operation would be to copy from the live > iommu_domain to the fixed one and then replace the iommu HW to the > fixed one. > > In the post-kexec world the iommu would recreate that special domain > and point the iommu at it. (copying the root and context descriptions > out of the pkernfs). Then somehow that would get into iommufd and VFIO > so that it could take over that special mapping during its startup. The save and restore model is super interesting - I'm keen to discuss this as an alternative. You're suggesting that IOMMU driver have a serialise phase just before kexec where it dumps everything into persistent memory and then after kexec pulls it back into ephemeral memory. That's probably do-able, but it may increase the critical section latency of live update (every millisecond counts!) and I'm also not too sure what that buys compared to always working with persistent memory and just always being in a state where persistent data is always being used and can be picked up as-is. However, the idea of a serialise and deserialise operation is relevant to a possible alternative to this RFC. My colleague Alex Graf is working on a framework called Kexec Hand Over (KHO): https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/#r That allows drivers/modules to mark arbitrary memory pages as persistent (ie: not allocatable by next kernel) and to pass over some serialised state across kexec. An alternative to IOMMU domain persistence could be to use KHO to mark the IOMMU root, context and domain page table pages as persistent via KHO. > > Then you'd build the normal operating ioas and hwpt (with all the > right page refcounts/etc) then switch to it and free the pkernfs > memory. > > It seems alot less invasive to me. The special case is clearly a > special case and doesn't mess up the normal operation of the drivers. Yes, a serialise/deserialise approach does have this distinct advantage of not needing to change the alloc/free code paths. Pkernfs requires a shim in the allocator to use persistent memory. > > > * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file. > > Really we should move the abstraction one level up and make the whole VFIO > > container persistent via a pkernfs file. That way you’d "just" re-open the VFIO > > container file and all of the DMA mappings inside VFIO would already be set up. > > I doubt this.. It probably needs to be much finer grained actually, > otherwise you are going to be serializing everything. Somehow I think > you are better to serialize a minimum and try to reconstruct > everything else in userspace. Like conserving iommufd IDs would be a > huge PITA. > > There are also going to be lots of security questions here, like we > can't just let userspace feed in any garbage and violate vfio and > iommu invariants. Right! This is definitely one of the big gaps at the moment: this approach requires that VFIO has the same state re-driven into it from userspace so that the persistent and ephemeral data match. If userspace does something dodgy, well, it may cause problems. :-) That's exactly why I thought we should move the abstraction up to a level that doesn't depend on userspace re-driving data. It sounds like you were suggesting similar in the first part of your comment, but I didn't fully understand how you'd like to see it presented to userspace. JG
On Mon, 2024-02-05 at 10:10 -0700, Alex Williamson wrote: > > * Needing to drive and re-hydrate the IOMMU page tables by defining > > an IOMMU file. > > Really we should move the abstraction one level up and make the > > whole VFIO > > container persistent via a pkernfs file. That way you’d "just" re- > > open the VFIO > > container file and all of the DMA mappings inside VFIO would already > > be set up. > > Note that the vfio container is on a path towards deprecation, this > should be refocused on vfio relative to iommufd. There would need to > be a strong argument for a container/type1 extension to support this, > iommufd would need to be the first class implementation. Thanks, Ack! When I first started putting pkernfs together, iommufd wasn't integrated into QEMU yet, hence I stuck with VFIO for this PoC. I'm thrilled to see that iommufd now seems to be integrated in QEMU! Good opportunity to get to grips with it. The VFIO-specific part of this patch is essentially ioctls on the *container* to be able to: 1. define persistent page tables (PPTs) on the containers so that those PPTs are used by the IOMMU domain and hence by all devices added to that container. https://github.com/jgowans/qemu/commit/e84cfb8186d71f797ef1f72d57d873222a9b479e 2. Tell VFIO to avoid mapping the memory in again after live update because it already exists. https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0 (the above flag should only be set *after* live update...). Do you have a rough suggestion about how similar could be done with iommufd? JG
On Wed, Feb 07, 2024 at 02:45:42PM +0000, Gowans, James wrote: > Hi Jason, > > Thanks for this great feedback on the approach - it's exactly the sort > of thing we were looking for. > > On Mon, 2024-02-05 at 13:42 -0400, Jason Gunthorpe wrote: > > > > On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote: > > > > > The main aspect we’re looking for feedback/opinions on here is the concept of > > > putting all persistent state in a single filesystem: combining guest RAM and > > > IOMMU pgtables in one store. Also, the question of a hard separation between > > > persistent memory and ephemeral memory, compared to allowing arbitrary pages to > > > be persisted. Pkernfs does it via a hard separation defined at boot time, other > > > approaches could make the carving out of persistent pages dynamic. > > > > I think if you are going to attempt something like this then the end > > result must bring things back to having the same data structures fully > > restored. > > > > It is fine that the pkernfs holds some persistant memory that > > guarentees the IOMMU can remain programmed and the VM pages can become > > fixed across the kexec > > > > But once the VMM starts to restore it self we need to get back to the > > original configuration: > > - A mmap that points to the VM's physical pages > > - An iommufd IOAS that points to the above mmap > > - An iommufd HWPT that represents that same mapping > > - An iommu_domain programmed into HW that the HWPT > > (A quick note on iommufd vs VFIO: I'll still keep referring to VFIO for > now because that's what I know, but will explore iommufd more and reply > in more detail about iommufd in the other email thread.) > > How much of this do you think should be done automatically, vs how much > should userspace need to drive? With this RFC userspace basically re- > drives everything, including re-injecting the file containing the > persistent page tables into the IOMMU domain via VFIO. My guess is that fully automatically is hard/impossible as there is lots and lots of related state that has to come back. Like how do you get all the internal iommufs IOAS related datastructures automatically. Seems way too hard. Feels simpler to have userspace redo whatever setup was needed to get back to the right spot. > Part of the reason is simplicity, to avoid having auto-deserialise code > paths in the drivers and modules. Another part of the reason so that > userspace can get FD handles on the resources. Typically FDs are > returned by doing actions like creating VFIO containers. If we make all > that automatic then there needs to be some other mechanism for auto- > restored resources to present themselves to userspace so that userspace > can discover and pick them up again. Right, there is lots of state all over the place that would hard to just re-materialize. > Can you expand on what you mean by "A mmap that points to the VM's > physical pages?" Are you suggesting that the QEMU process automatically > gets something appearing in it's address space? Part of the live update > process involves potentially changing the userspace binaries: doing > kexec and booting a new system is an opportunity to boot new versions of > the userspace binary. So we shouldn't try to preserve too much of > userspace state; it's better to let it re-create internal data > structures do fresh mmaps. I expect the basic flow would be like: Starting kernel - run the VMM - Allocate VM memory in the pkernfs - mmap that VM memory - Attach the VM memory to KVM - Attach the VM memory mmap to IOMMUFD - Operate the VM Suspending the kernel - Stop touching iommufd - Freeze changes to the IOMMU, and move its working memory to pkernfs - Exit the kernel New kernel - Recover the frozen IOMMU back to partially running, like crash dump. Continue to use some of the working memory in the pkernfs - run the new VMM. Some IOMMU_DOMAIN_PKERNFS thing to represent this state - mmap the VM memory - Get KVM going again - Attach the new VMM's VM memory mmap to IOMMUFD - Replace the iommu partial configuration with a full configuration - Free the pkernfs iommu related memory > What I'm really asking is: do you have a specific suggestion about how > these persistent resources should present themselves to userspace and > how userspace can discover them and pick them up? The only tricky bit in the above is having VFIO know it should leave the iommu and PCI device state alone when the VFIO cdev is first opened. Otherwise everything else is straightforward. Presumably vfio would know it inherited a pkernfs blob and would do the right stuff. May be some uAPI fussing there to handshake that properly Once VFIO knows this it can operate iommufd to conserve the IOMMU_DOMAIN_PKERNFS as well. > > Ie you can't just reboot and leave the IOMMU hanging out in some > > undefined land - especially in latest kernels! > > Not too sure what you mean by "undefined land" - the idea is that the > IOMMU keeps doing what it was going until userspace comes along re- In terms of how the iommu subystems understands what the iommu is doing. The iommu subsystem now forces the iommu into defined states as part of its startup and you need an explicit defined state which means "continuing to use the pkernfs saved state" which the iommu driver deliberately enters. > creates the handles to the IOMMU at which point it can do modifications > like change mappings or tear the domain down. This is what deferred > attached gives us, I believe, and why I had to change it to be > enabled. VFIO doesn't trigger deferred attach at all, that patch made no sense. > > For vt-d you need to retain the entire root table and all the required > > context entries too, The restarting iommu needs to understand that it > > has to "restore" a temporary iommu_domain from the pkernfs. > > You can later reconstitute a proper iommu_domain from the VMM and > > atomic switch. > > Why does it need to go via a temporary domain? Because that is the software model we have now. You must be explicit not in some lalal undefined land of "i don't know WTF is going on but if I squint this is doing some special thing!" That concept is dead in the iommu subsystem, you must be explicit. If the iommu is translating through special page tables stored in a pkernfs then you need a IOMMU_DOMAIN_PKERNFS to represent that behavior. > > So, I'm surprised to see this approach where things just live forever > > in the kernfs, I don't see how "restore" is going to work very well > > like this. > > Can you expand on why the suggested restore path will be problematic? In > summary the idea is to re-create all of the "ephemeral" data structures > by re-doing ioctls like MAP_DMA, but keeping the persistent IOMMU > root/context tables pointed at the original persistent page tables. The > ephemeral data structures are re-created in userspace but the persistent > page tables left alone. This is of course dependent on userspace re- > creating things *correctly* - it can easily do the wrong thing. Perhaps > this is the issue? Or is there a problem even if userspace is sane. Because how do you regain control of the iommu in a fully configured way with all the right pin counts and so forth? It seems impossible like this, all that information is washed away during the kexec. It seems easier if the pkernfs version of the iommu configuration is temporary and very special. The normal working mode is just exactly as today. > > I would think that a save/restore mentalitity would make more > > sense. For instance you could make a special iommu_domain that is fixed > > and lives in the pkernfs. The operation would be to copy from the live > > iommu_domain to the fixed one and then replace the iommu HW to the > > fixed one. > > > > In the post-kexec world the iommu would recreate that special domain > > and point the iommu at it. (copying the root and context descriptions > > out of the pkernfs). Then somehow that would get into iommufd and VFIO > > so that it could take over that special mapping during its startup. > > The save and restore model is super interesting - I'm keen to discuss > this as an alternative. You're suggesting that IOMMU driver have a > serialise phase just before kexec where it dumps everything into > persistent memory and then after kexec pulls it back into ephemeral > memory. That's probably do-able, but it may increase the critical > section latency of live update (every millisecond counts!) Suspending preperation can be done before stopping the vCPUs. You have to commit to freezing the iommu which only means things like memory hotplug can't progress. So it isn't critical path Same on resume, you can resum kvm and the vCPUs and leave the IOMMU in its suspended state while you work on returning it to normal operation. Again only memory hotplug becomes blocked so it isn't critical path. > and I'm also not too sure what that buys compared to always working > with persistent memory and just always being in a state where > persistent data is always being used and can be picked up as-is. You don't mess up the entire driver and all of its memory management, and end up with a problem where you can't actually properly restore it anyhow :) > However, the idea of a serialise and deserialise operation is relevant > to a possible alternative to this RFC. My colleague Alex Graf is working > on a framework called Kexec Hand Over (KHO): > https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/#r > That allows drivers/modules to mark arbitrary memory pages as persistent > (ie: not allocatable by next kernel) and to pass over some serialised > state across kexec. > An alternative to IOMMU domain persistence could be to use KHO to mark > the IOMMU root, context and domain page table pages as persistent via > KHO. IMHO it doesn't matter how you get the memory across the kexec, you still have to answer all these questions about how does the new kernel actually keep working with this inherited data, and how does it transform the inherited data into operating data that is properly situated in the kernel data structures. You can't just startup iommufd and point it at a set of io page tables that something else populated. It is fundamentally wrong and would lead to corrupting the mm's pin counts. > > > * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file. > > > Really we should move the abstraction one level up and make the whole VFIO > > > container persistent via a pkernfs file. That way you’d "just" re-open the VFIO > > > container file and all of the DMA mappings inside VFIO would already be set up. > > > > I doubt this.. It probably needs to be much finer grained actually, > > otherwise you are going to be serializing everything. Somehow I think > > you are better to serialize a minimum and try to reconstruct > > everything else in userspace. Like conserving iommufd IDs would be a > > huge PITA. > > > > There are also going to be lots of security questions here, like we > > can't just let userspace feed in any garbage and violate vfio and > > iommu invariants. > > Right! This is definitely one of the big gaps at the moment: this > approach requires that VFIO has the same state re-driven into it from > userspace so that the persistent and ephemeral data match. If userspace > does something dodgy, well, it may cause problems. :-) > That's exactly why I thought we should move the abstraction up to a > level that doesn't depend on userspace re-driving data. It sounds like > you were suggesting similar in the first part of your comment, but I > didn't fully understand how you'd like to see it presented to userspace. I'd think you end up with some scenario where the pkernfs data has to be trusted and sealed somehow before vfio would understand it. Ie you have to feed it into vfio/etc via kexec only. From a security perspective it does seem horribly wrong to expose such sensitive data in a filesystem API where there are API surfaces that would let userspace manipulate it. At least from the iommu/vfio perspective: The trusted data should originate inside a signed kernel only. The signed kernel should prevent userspace from reading or writing it The next kernel should trust that the prior kernel put the correct data in there. There should be no option for the next kernel userspace to read or write the data. The next kernel can automatically affiliate things with the trusted inherited data that it knows was passed from the prior signed kernel. eg autocreate a IOMMU_DOMAIN_PKERNFS, tweak VFIO, etc. I understand the appeal of making a pkernfs to hold the VM's memory pages, but it doesn't seem so secure for kernel internal data strucures.. Jason
On Wed, Feb 07, 2024 at 02:56:33PM +0000, Gowans, James wrote: > 2. Tell VFIO to avoid mapping the memory in again after live update > because it already exists. > https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0 > (the above flag should only be set *after* live update...). Definately no to that entire idea. It completely breaks how the memory lifetime model works in iommufd. iommufd has to re-establish its pins, and has to rebuild all its mapping data structures. Otherwise it won't work correctly at all. This is what I was saying in the other thread, you can't just ignore fully restoring the iommu environment. The end goal must be to have fully reconstituted iommufd with all its maps, ioas's, and memory pins back to fully normal operation. IMHO you need to focus on atomic replace where you go from the frozen pkernfs environment to a live operating enviornment by hitlessly replacing the IO page table in the HW. Ie going from an IOMMU_DOMAIN_PKERFS to an IOMMU_DOMAIN_PAGING owned by iommufd that describes exactly the same translation. "adopting" an entire io page table with unknown contents, and still being able to correctly do map/unmap seems way too hard. Jason