mbox series

[00/10] Introduce guestmemfs: persistent in-memory filesystem

Message ID 20240805093245.889357-1-jgowans@amazon.com (mailing list archive)
Headers show
Series Introduce guestmemfs: persistent in-memory filesystem | expand

Message

James Gowans Aug. 5, 2024, 9:32 a.m. UTC
In this patch series a new in-memory filesystem designed specifically
for live update is implemented. Live update is a mechanism to support
updating a hypervisor in a way that has limited impact to running
virtual machines. This is done by pausing/serialising running VMs,
kexec-ing into a new kernel, starting new VMM processes and then
deserialising/resuming the VMs so that they continue running from where
they were. To support this, guest memory needs to be preserved.

Guestmemfs implements preservation acrosss kexec by carving out a large
contiguous block of host system RAM early in boot which is then used as
the data for the guestmemfs files. As well as preserving that large
block of data memory across kexec, the filesystem metadata is preserved
via the Kexec Hand Over (KHO) framework (still under review):
https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/

Filesystem metadata is structured to make preservation across kexec
easy: inodes are one large contiguous array, and each inode has a
"mappings" block which defines which block from the filesystem data
memory corresponds to which offset in the file.

There are additional constraints/requirements which guestmemfs aims to
meet:

1. Secret hiding: all filesystem data is removed from the kernel direct
map so immune from speculative access. read()/write() are not supported;
the only way to get at the data is via mmap.

2. Struct page overhead elimination: the memory is not managed by the
buddy allocator and hence has no struct pages.

3. PMD and PUD level allocations for TLB performance: guestmemfs
allocates PMD-sized pages to back files which improves TLB perf (caveat
below!). PUD size allocations are a next step.

4. Device assignment: being able to use guestmemfs memory for
VFIO/iommufd mappings, and allow those mappings to survive and continue
to be used across kexec.


Next steps
=========

The idea is that this patch series implements a minimal filesystem to
provide the foundations for in-memory persistent across kexec files.
One this foundation is in place it will be extended:

1. Improve the filesystem to be more comprehensive - currently it's just
functional enough to demonstrate the main objective of reserved memory
and persistence via KHO.

2. Build support for iommufd IOAS and HWPT persistence, and integrate
that with guestmemfs. The idea is that if VMs have DMA devices assigned
to them, DMA should continue running across kexec. A future patch series
will add support for this in iommufd and connect iommufd to guestmemfs
so that guestmemfs files can remain mapped into the IOMMU during kexec.

3. Support a guest_memfd interface to files so that they can be used for
confidential computing without needing to mmap into userspace.

3. Gigantic PUD level mappings for even better TLB perf.

Caveats
=======

There are a issues with the current implementation which should be
solved either in this patch series or soon in follow-on work:

1. Although PMD-size allocations are done, PTE-level page tables are
still created. This is because guestmemfs uses remap_pfn_range() to set
up userspace pgtables. Currently remap_pfn_range() only creates
PTE-level mappings. I suggest enhancing remap_pfn_range() to support
creating higher level mappings where possible, by adding pmd_special
and pud_special flags.

2. NUMA support is currently non-existent. To make this more generally
useful it's necessary to have NUMA-awareness. One thought on how to do
this is to be able to specify multiple allocations with wNUMA affinity
on the kernel cmdline and have multiple mount points, one per NUMA node.
Currently, for simplicity, only a single contiguous filesystem data
allocation and a single mount point is supported.

3. MCEs are currently not handled - we need to add functionality for
this to be able to track block ownership and deliver an MCE correctly.

4. Looking for reviews from filesystem experts to see if necessary
callbacks, refcounting, locking, etc, is done correctly.

Open questions
==============

It is not too clear if or how guestmemfs should use DAX as a source of
memory. Seeing as guestmemfs has an in-memory design, it seems that it
is not necessary to use DAX as a source of memory, but I am keen for
guidance/input on whether DAX should be used here.

The filesystem data memory is removed from the direct map for secret
hiding, but it is still necessary to mmap it to be accessible to KVM.
For improving secret hiding even more a guest_memfd-style interface
could be used to remove the need to mmap. That introduces a new problem
of the memory being completely inaccessible to KVM for this like MMIO
instruction emulation. How can this be handled?

Related Work
============

There are similarities to a few attempts at solving aspects of this
problem previously.

The original was probably PKRAM from Oracle; a tempfs filesystem with
persistence:
https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com/
guestmemfs will additionally provide secret hiding, PMD/PUD allocations
and a path to DMA persistence and NUMA support.

Dmemfs from Tencent aimed to remove the need for struct page overhead:
https://lore.kernel.org/kvm/cover.1602093760.git.yuleixzhang@tencent.com/
Guestmemfs provides this benefit too, along with persistence across
kexec and secret hiding. 

Pkernfs attempted to solve guest memory persistence and IOMMU
persistence all in one:
https://lore.kernel.org/all/20240205120203.60312-1-jgowans@amazon.com/
Guestmemfs is a re-work of that to only persist guest RAM in the
filesystem, and to use KHO for filesystem metadata. IOMMU persistence
will be implemented independently with persistent iommufd domains via
KHO.

Testing
=======

The testing for this can be seen in the Documentation file in this patch
series. Essentially it is using a guestmemfs file for a QEMU VM's RAM,
doing a kexec, restoring the QEMU VM and confirming that the VM picked
up from where it left off.

James Gowans (10):
  guestmemfs: Introduce filesystem skeleton
  guestmemfs: add inode store, files and dirs
  guestmemfs: add persistent data block allocator
  guestmemfs: support file truncation
  guestmemfs: add file mmap callback
  kexec/kho: Add addr flag to not initialise memory
  guestmemfs: Persist filesystem metadata via KHO
  guestmemfs: Block modifications when serialised
  guestmemfs: Add documentation and usage instructions
  MAINTAINERS: Add maintainers for guestmemfs

 Documentation/filesystems/guestmemfs.rst |  87 +++++++
 MAINTAINERS                              |   8 +
 arch/x86/mm/init_64.c                    |   2 +
 fs/Kconfig                               |   1 +
 fs/Makefile                              |   1 +
 fs/guestmemfs/Kconfig                    |  11 +
 fs/guestmemfs/Makefile                   |   8 +
 fs/guestmemfs/allocator.c                |  40 +++
 fs/guestmemfs/dir.c                      |  43 ++++
 fs/guestmemfs/file.c                     | 106 ++++++++
 fs/guestmemfs/guestmemfs.c               | 160 ++++++++++++
 fs/guestmemfs/guestmemfs.h               |  60 +++++
 fs/guestmemfs/inode.c                    | 189 ++++++++++++++
 fs/guestmemfs/serialise.c                | 302 +++++++++++++++++++++++
 include/linux/guestmemfs.h               |  16 ++
 include/uapi/linux/kexec.h               |   6 +
 kernel/kexec_kho_in.c                    |  12 +-
 kernel/kexec_kho_out.c                   |   4 +
 18 files changed, 1055 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/filesystems/guestmemfs.rst
 create mode 100644 fs/guestmemfs/Kconfig
 create mode 100644 fs/guestmemfs/Makefile
 create mode 100644 fs/guestmemfs/allocator.c
 create mode 100644 fs/guestmemfs/dir.c
 create mode 100644 fs/guestmemfs/file.c
 create mode 100644 fs/guestmemfs/guestmemfs.c
 create mode 100644 fs/guestmemfs/guestmemfs.h
 create mode 100644 fs/guestmemfs/inode.c
 create mode 100644 fs/guestmemfs/serialise.c
 create mode 100644 include/linux/guestmemfs.h

Comments

Theodore Ts'o Aug. 5, 2024, 2:32 p.m. UTC | #1
On Mon, Aug 05, 2024 at 11:32:35AM +0200, James Gowans wrote:
> Guestmemfs implements preservation acrosss kexec by carving out a
> large contiguous block of host system RAM early in boot which is
> then used as the data for the guestmemfs files.

Why does the memory have to be (a) contiguous, and (b) carved out of
*host* system memory early in boot?  This seems to be very inflexible;
it means that you have to know how much memory will be needed for
guestmemfs in early boot.

Also, the VMM update process is not a common case thing, so we don't
need to optimize for performance.  If we need to temporarily use
swap/zswap to allocate memory at VMM update time, and if the pages
aren't contiguous when they are copied out before doing the VMM
update, that might be very well worth the vast of of memory needed to
pay for reserving memory on the host for the VMM update that only
might happen once every few days/weeks/months (depending on whether
you are doing update just for high severity security fixes, or for
random VMM updates).

Even if you are updating the VMM every few days, it still doesn't seem
that permanently reserving contiguous memory on the host can be
justified from a TCO perspective.

Cheers,

						- Ted
Paolo Bonzini Aug. 5, 2024, 2:41 p.m. UTC | #2
On Mon, Aug 5, 2024 at 4:35 PM Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Aug 05, 2024 at 11:32:35AM +0200, James Gowans wrote:
> > Guestmemfs implements preservation acrosss kexec by carving out a
> > large contiguous block of host system RAM early in boot which is
> > then used as the data for the guestmemfs files.
>
> Also, the VMM update process is not a common case thing, so we don't
> need to optimize for performance.  If we need to temporarily use
> swap/zswap to allocate memory at VMM update time, and if the pages
> aren't contiguous when they are copied out before doing the VMM
> update

I'm not sure I understand, where would this temporary allocation happen?

> that might be very well worth the vast of of memory needed to
> pay for reserving memory on the host for the VMM update that only
> might happen once every few days/weeks/months (depending on whether
> you are doing update just for high severity security fixes, or for
> random VMM updates).
>
> Even if you are updating the VMM every few days, it still doesn't seem
> that permanently reserving contiguous memory on the host can be
> justified from a TCO perspective.

As far as I understand, this is intended for use in systems that do
not do anything except hosting VMs, where anyway you'd devote 90%+ of
host memory to hugetlbfs gigapages.

Paolo
James Gowans Aug. 5, 2024, 7:47 p.m. UTC | #3
On Mon, 2024-08-05 at 16:41 +0200, Paolo Bonzini wrote:
> On Mon, Aug 5, 2024 at 4:35 PM Theodore Ts'o <tytso@mit.edu> wrote:
> > On Mon, Aug 05, 2024 at 11:32:35AM +0200, James Gowans wrote:
> > > Guestmemfs implements preservation acrosss kexec by carving out a
> > > large contiguous block of host system RAM early in boot which is
> > > then used as the data for the guestmemfs files.
> > 
> > Also, the VMM update process is not a common case thing, so we don't
> > need to optimize for performance.  If we need to temporarily use
> > swap/zswap to allocate memory at VMM update time, and if the pages
> > aren't contiguous when they are copied out before doing the VMM
> > update
> 
> I'm not sure I understand, where would this temporary allocation happen?

The intended use case for live update is to update the entirely of the
hypervisor: kexecing into a new kernel, launching new VMM processes. So
anything in kernel state (page tables, VMAs, (z)swap entries, struct
pages, etc) is all lost after kexec and needs to be re-created. That's
the job of guestmemfs: provide the persistence across kexec and ability
to re-create the mapping by re-opening the files.

It would be far too impactful to need to write out the whole VM memory
to disk. Also with CoCo VMs that's not really possible. When virtual
machines are running, every millisecond of down time counts. It would be
wasteful to need to keep terabytes of SSDs lying around just to briefly
write all the guest RAM there and then read it out a moment later. Much
better to leave all the guest memory where it is: in memory.

> 
> > that might be very well worth the vast of of memory needed to
> > pay for reserving memory on the host for the VMM update that only
> > might happen once every few days/weeks/months (depending on whether
> > you are doing update just for high severity security fixes, or for
> > random VMM updates).
> > 
> > Even if you are updating the VMM every few days, it still doesn't seem
> > that permanently reserving contiguous memory on the host can be
> > justified from a TCO perspective.
> 
> As far as I understand, this is intended for use in systems that do
> not do anything except hosting VMs, where anyway you'd devote 90%+ of
> host memory to hugetlbfs gigapages.

Exactly, the use case here is for machines whose only job is to be a KVM
hypervisor. The majority of system RAM is donated to guestmemfs;
anything else (host kernel memory and VMM anonymous memory) is
essentially overhead and should be minimised.

JG
James Gowans Aug. 5, 2024, 7:53 p.m. UTC | #4
On Mon, 2024-08-05 at 10:32 -0400, Theodore Ts'o wrote:
> On Mon, Aug 05, 2024 at 11:32:35AM +0200, James Gowans wrote:
> > Guestmemfs implements preservation acrosss kexec by carving out a
> > large contiguous block of host system RAM early in boot which is
> > then used as the data for the guestmemfs files.
> 
> Why does the memory have to be (a) contiguous, and (b) carved out of
> *host* system memory early in boot?  This seems to be very inflexible;
> it means that you have to know how much memory will be needed for
> guestmemfs in early boot.

The main reason for both of these is to guarantee that the huge (2 MiB
PMD) and gigantic (1 GiB PUD) allocations can happen. While this patch
series only does huge page allocations for simplicity, the intention is
to extend it to gigantic PUD level allocations soon (I'd like to get the
simple functionality merged before adding more complexity).
Other than doing a memblock allocation at early boot there really is no
way that I know of to do GiB-size allocations dynamically.

In terms of the need for a contiguous chunk, that's a bit of a
simplification for now. As mentioned in the cover letter there currently
isn't any NUMA support in this patch series. We'd want to add the
ability to do NUMA handling in following patch series. In that case it
would be multiple contiguous allocations, one for each NUMA node that
the user wants to run VMs on.

JG
Jan Kara Aug. 5, 2024, 8:01 p.m. UTC | #5
On Mon 05-08-24 11:32:35, James Gowans wrote:
> In this patch series a new in-memory filesystem designed specifically
> for live update is implemented. Live update is a mechanism to support
> updating a hypervisor in a way that has limited impact to running
> virtual machines. This is done by pausing/serialising running VMs,
> kexec-ing into a new kernel, starting new VMM processes and then
> deserialising/resuming the VMs so that they continue running from where
> they were. To support this, guest memory needs to be preserved.
> 
> Guestmemfs implements preservation acrosss kexec by carving out a large
> contiguous block of host system RAM early in boot which is then used as
> the data for the guestmemfs files. As well as preserving that large
> block of data memory across kexec, the filesystem metadata is preserved
> via the Kexec Hand Over (KHO) framework (still under review):
> https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/
> 
> Filesystem metadata is structured to make preservation across kexec
> easy: inodes are one large contiguous array, and each inode has a
> "mappings" block which defines which block from the filesystem data
> memory corresponds to which offset in the file.
> 
> There are additional constraints/requirements which guestmemfs aims to
> meet:
> 
> 1. Secret hiding: all filesystem data is removed from the kernel direct
> map so immune from speculative access. read()/write() are not supported;
> the only way to get at the data is via mmap.
> 
> 2. Struct page overhead elimination: the memory is not managed by the
> buddy allocator and hence has no struct pages.
> 
> 3. PMD and PUD level allocations for TLB performance: guestmemfs
> allocates PMD-sized pages to back files which improves TLB perf (caveat
> below!). PUD size allocations are a next step.
> 
> 4. Device assignment: being able to use guestmemfs memory for
> VFIO/iommufd mappings, and allow those mappings to survive and continue
> to be used across kexec.
 
To me the basic functionality resembles a lot hugetlbfs. Now I know very
little details about hugetlbfs so I've added relevant folks to CC. Have you
considered to extend hugetlbfs with the functionality you need (such as
preservation across kexec) instead of implementing completely new filesystem?

								Honza
 
> Next steps
> =========
> 
> The idea is that this patch series implements a minimal filesystem to
> provide the foundations for in-memory persistent across kexec files.
> One this foundation is in place it will be extended:
> 
> 1. Improve the filesystem to be more comprehensive - currently it's just
> functional enough to demonstrate the main objective of reserved memory
> and persistence via KHO.
> 
> 2. Build support for iommufd IOAS and HWPT persistence, and integrate
> that with guestmemfs. The idea is that if VMs have DMA devices assigned
> to them, DMA should continue running across kexec. A future patch series
> will add support for this in iommufd and connect iommufd to guestmemfs
> so that guestmemfs files can remain mapped into the IOMMU during kexec.
> 
> 3. Support a guest_memfd interface to files so that they can be used for
> confidential computing without needing to mmap into userspace.
> 
> 3. Gigantic PUD level mappings for even better TLB perf.
> 
> Caveats
> =======
> 
> There are a issues with the current implementation which should be
> solved either in this patch series or soon in follow-on work:
> 
> 1. Although PMD-size allocations are done, PTE-level page tables are
> still created. This is because guestmemfs uses remap_pfn_range() to set
> up userspace pgtables. Currently remap_pfn_range() only creates
> PTE-level mappings. I suggest enhancing remap_pfn_range() to support
> creating higher level mappings where possible, by adding pmd_special
> and pud_special flags.
> 
> 2. NUMA support is currently non-existent. To make this more generally
> useful it's necessary to have NUMA-awareness. One thought on how to do
> this is to be able to specify multiple allocations with wNUMA affinity
> on the kernel cmdline and have multiple mount points, one per NUMA node.
> Currently, for simplicity, only a single contiguous filesystem data
> allocation and a single mount point is supported.
> 
> 3. MCEs are currently not handled - we need to add functionality for
> this to be able to track block ownership and deliver an MCE correctly.
> 
> 4. Looking for reviews from filesystem experts to see if necessary
> callbacks, refcounting, locking, etc, is done correctly.
> 
> Open questions
> ==============
> 
> It is not too clear if or how guestmemfs should use DAX as a source of
> memory. Seeing as guestmemfs has an in-memory design, it seems that it
> is not necessary to use DAX as a source of memory, but I am keen for
> guidance/input on whether DAX should be used here.
> 
> The filesystem data memory is removed from the direct map for secret
> hiding, but it is still necessary to mmap it to be accessible to KVM.
> For improving secret hiding even more a guest_memfd-style interface
> could be used to remove the need to mmap. That introduces a new problem
> of the memory being completely inaccessible to KVM for this like MMIO
> instruction emulation. How can this be handled?
> 
> Related Work
> ============
> 
> There are similarities to a few attempts at solving aspects of this
> problem previously.
> 
> The original was probably PKRAM from Oracle; a tempfs filesystem with
> persistence:
> https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com/
> guestmemfs will additionally provide secret hiding, PMD/PUD allocations
> and a path to DMA persistence and NUMA support.
> 
> Dmemfs from Tencent aimed to remove the need for struct page overhead:
> https://lore.kernel.org/kvm/cover.1602093760.git.yuleixzhang@tencent.com/
> Guestmemfs provides this benefit too, along with persistence across
> kexec and secret hiding. 
> 
> Pkernfs attempted to solve guest memory persistence and IOMMU
> persistence all in one:
> https://lore.kernel.org/all/20240205120203.60312-1-jgowans@amazon.com/
> Guestmemfs is a re-work of that to only persist guest RAM in the
> filesystem, and to use KHO for filesystem metadata. IOMMU persistence
> will be implemented independently with persistent iommufd domains via
> KHO.
> 
> Testing
> =======
> 
> The testing for this can be seen in the Documentation file in this patch
> series. Essentially it is using a guestmemfs file for a QEMU VM's RAM,
> doing a kexec, restoring the QEMU VM and confirming that the VM picked
> up from where it left off.
> 
> James Gowans (10):
>   guestmemfs: Introduce filesystem skeleton
>   guestmemfs: add inode store, files and dirs
>   guestmemfs: add persistent data block allocator
>   guestmemfs: support file truncation
>   guestmemfs: add file mmap callback
>   kexec/kho: Add addr flag to not initialise memory
>   guestmemfs: Persist filesystem metadata via KHO
>   guestmemfs: Block modifications when serialised
>   guestmemfs: Add documentation and usage instructions
>   MAINTAINERS: Add maintainers for guestmemfs
> 
>  Documentation/filesystems/guestmemfs.rst |  87 +++++++
>  MAINTAINERS                              |   8 +
>  arch/x86/mm/init_64.c                    |   2 +
>  fs/Kconfig                               |   1 +
>  fs/Makefile                              |   1 +
>  fs/guestmemfs/Kconfig                    |  11 +
>  fs/guestmemfs/Makefile                   |   8 +
>  fs/guestmemfs/allocator.c                |  40 +++
>  fs/guestmemfs/dir.c                      |  43 ++++
>  fs/guestmemfs/file.c                     | 106 ++++++++
>  fs/guestmemfs/guestmemfs.c               | 160 ++++++++++++
>  fs/guestmemfs/guestmemfs.h               |  60 +++++
>  fs/guestmemfs/inode.c                    | 189 ++++++++++++++
>  fs/guestmemfs/serialise.c                | 302 +++++++++++++++++++++++
>  include/linux/guestmemfs.h               |  16 ++
>  include/uapi/linux/kexec.h               |   6 +
>  kernel/kexec_kho_in.c                    |  12 +-
>  kernel/kexec_kho_out.c                   |   4 +
>  18 files changed, 1055 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/filesystems/guestmemfs.rst
>  create mode 100644 fs/guestmemfs/Kconfig
>  create mode 100644 fs/guestmemfs/Makefile
>  create mode 100644 fs/guestmemfs/allocator.c
>  create mode 100644 fs/guestmemfs/dir.c
>  create mode 100644 fs/guestmemfs/file.c
>  create mode 100644 fs/guestmemfs/guestmemfs.c
>  create mode 100644 fs/guestmemfs/guestmemfs.h
>  create mode 100644 fs/guestmemfs/inode.c
>  create mode 100644 fs/guestmemfs/serialise.c
>  create mode 100644 include/linux/guestmemfs.h
> 
> -- 
> 2.34.1
>
Jason Gunthorpe Aug. 5, 2024, 11:29 p.m. UTC | #6
On Mon, Aug 05, 2024 at 10:01:51PM +0200, Jan Kara wrote:

> > 4. Device assignment: being able to use guestmemfs memory for
> > VFIO/iommufd mappings, and allow those mappings to survive and continue
> > to be used across kexec.

That's a fun one. Proposals for that will be very interesting!

> To me the basic functionality resembles a lot hugetlbfs. Now I know very
> little details about hugetlbfs so I've added relevant folks to CC. Have you
> considered to extend hugetlbfs with the functionality you need (such as
> preservation across kexec) instead of implementing completely new filesystem?

In mm circles we've broadly been talking about splitting the "memory
provider" part out of hugetlbfs into its own layer. This would include
the carving out of kernel memory at boot and organizing it by page
size to allow huge ptes.

It would make alot of sense to have only one carve out mechanism, and
several consumers - hugetlbfs, the new private guestmemfd, this thing,
for example.

Jason
James Gowans Aug. 6, 2024, 8:12 a.m. UTC | #7
On Mon, 2024-08-05 at 22:01 +0200, Jan Kara wrote:
> 
> On Mon 05-08-24 11:32:35, James Gowans wrote:
> > In this patch series a new in-memory filesystem designed specifically
> > for live update is implemented. Live update is a mechanism to support
> > updating a hypervisor in a way that has limited impact to running
> > virtual machines. This is done by pausing/serialising running VMs,
> > kexec-ing into a new kernel, starting new VMM processes and then
> > deserialising/resuming the VMs so that they continue running from where
> > they were. To support this, guest memory needs to be preserved.
> > 
> > Guestmemfs implements preservation acrosss kexec by carving out a large
> > contiguous block of host system RAM early in boot which is then used as
> > the data for the guestmemfs files. As well as preserving that large
> > block of data memory across kexec, the filesystem metadata is preserved
> > via the Kexec Hand Over (KHO) framework (still under review):
> > https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/
> > 
> > Filesystem metadata is structured to make preservation across kexec
> > easy: inodes are one large contiguous array, and each inode has a
> > "mappings" block which defines which block from the filesystem data
> > memory corresponds to which offset in the file.
> > 
> > There are additional constraints/requirements which guestmemfs aims to
> > meet:
> > 
> > 1. Secret hiding: all filesystem data is removed from the kernel direct
> > map so immune from speculative access. read()/write() are not supported;
> > the only way to get at the data is via mmap.
> > 
> > 2. Struct page overhead elimination: the memory is not managed by the
> > buddy allocator and hence has no struct pages.
> > 
> > 3. PMD and PUD level allocations for TLB performance: guestmemfs
> > allocates PMD-sized pages to back files which improves TLB perf (caveat
> > below!). PUD size allocations are a next step.
> > 
> > 4. Device assignment: being able to use guestmemfs memory for
> > VFIO/iommufd mappings, and allow those mappings to survive and continue
> > to be used across kexec.
> 
> To me the basic functionality resembles a lot hugetlbfs. Now I know very
> little details about hugetlbfs so I've added relevant folks to CC. Have you
> considered to extend hugetlbfs with the functionality you need (such as
> preservation across kexec) instead of implementing completely new filesystem?

Oof, I forgot to mention hugetlbfs in the cover letter - thanks for
raising this! Indeed, there are similarities: in-memory fs, with
huge/gigantic allocations.

We did consider extending hugetlbfs to support persistence, but there
are differences in requirements which we're not sure would be practical
or desirable to add to hugetlbfs.

1. Secret hiding: with guestmemfs all of the memory is out of the kernel
direct map as an additional defence mechanism. This means no
read()/write() syscalls to guestmemfs files, and no IO to it. The only
way to access it is to mmap the file.

2. No struct page overhead: the intended use case is for systems whose
sole job is to be a hypervisor, typically for large (multi-GiB) VMs, so
the majority of system RAM would be donated to this fs. We definitely
don't want 4 KiB struct pages here as it would be a significant
overhead. That's why guestmemfs carves the memory out in early boot and
sets memblock flags to avoid struct page allocation. I don't know if
hugetlbfs does anything fancy to avoid allocating PTE-level struct pages
for its memory?

3. guest_memfd interface: For confidential computing use-cases we need
to provide a guest_memfd style interface so that these FDs can be used
as a guest_memfd file in KVM memslots. Would there be interest in
extending hugetlbfs to also support a guest_memfd style interface?

4. Metadata designed for persistence: guestmemfs will need to keep
simple internal metadata data structures (limited allocations, limited
fragmentation) so that pages can easily and efficiently be marked as
persistent via KHO. Something like slab allocations would probably be a
no-go as then we'd need to persist and reconstruct the slab allocator. I
don't know how hugetlbfs structures its fs metadata but I'm guessing it
uses the slab and does lots of small allocations so trying to retrofit
persistence via KHO to it may be challenging.

5. Integration with persistent IOMMU mappings: to keep DMA running
across kexec, iommufd needs to know that the backing memory for an IOAS
is persistent too. The idea is to do some DMA pinning of persistent
files, which would require iommufd/guestmemfs integration - would we
want to add this to hugetlbfs?

6. Virtualisation-specific APIs: starting to get a bit esoteric here,
but use-cases like being able to carve out specific chunks of memory
from a running VM and turn it into memory for another side car VM, or
doing post-copy LM via DMA by mapping memory into the IOMMU but taking
page faults on the CPU. This may require virtualisation-specific ioctls
on the files which wouldn't be generally applicable to hugetlbfs.

7. NUMA control: a requirement is to always have correct NUMA affinity.
While currently not implemented the idea is to extend the guestmemfs
allocation to support specifying allocation sizes from each NUMA node at
early boot, and then having multiple mount points, one per NUMA node (or
something like that...). Unclear if this is something hugetlbfs would
want.

There are probably more potential issues, but those are the ones that
come to mind... That being said, if hugetlbfs maintainers are interested
in going in this direction then we can definitely look at enhancing
hugetlbfs.

I think there are two types of problems: "Would hugetlbfs want this
functionality?" - that's the majority. An a few are "This would be hard
with hugetlbfs!" - persistence probably falls into this category.

Looking forward to input from maintainers. :-)

JG
James Gowans Aug. 6, 2024, 8:26 a.m. UTC | #8
On Mon, 2024-08-05 at 20:29 -0300, Jason Gunthorpe wrote:
> 
> On Mon, Aug 05, 2024 at 10:01:51PM +0200, Jan Kara wrote:
> 
> > > 4. Device assignment: being able to use guestmemfs memory for
> > > VFIO/iommufd mappings, and allow those mappings to survive and continue
> > > to be used across kexec.
> 
> That's a fun one. Proposals for that will be very interesting!

Yup! We have an LPC session for this; looking forward to discussing more
there: https://lpc.events/event/18/contributions/1686/
I'll be working on a iommufd RFC soon; should get it out before then.

> 
> > To me the basic functionality resembles a lot hugetlbfs. Now I know very
> > little details about hugetlbfs so I've added relevant folks to CC. Have you
> > considered to extend hugetlbfs with the functionality you need (such as
> > preservation across kexec) instead of implementing completely new filesystem?
> 
> In mm circles we've broadly been talking about splitting the "memory
> provider" part out of hugetlbfs into its own layer. This would include
> the carving out of kernel memory at boot and organizing it by page
> size to allow huge ptes.
> 
> It would make alot of sense to have only one carve out mechanism, and
> several consumers - hugetlbfs, the new private guestmemfd, this thing,
> for example.

The actual allocation in guestmemfs isn't too complex, basically just a
hook in mem_init() (that's a bit yucky as it's arch-specific) and then a
call to memblock allocator.
That being said, the functionality for this patch series is currently
intentionally limited: missing NUMA support, and only doing PMD (2 MiB)
block allocations for files - we want PUD (1 GiB) where possible falling
back to splitting to 2 MiB for smaller files. That will complicate
things, so perhaps a memory provider will be useful when this gets more
functionally complete. Keen to hear more!

JG
David Hildenbrand Aug. 6, 2024, 1:43 p.m. UTC | #9
> 1. Secret hiding: with guestmemfs all of the memory is out of the kernel
> direct map as an additional defence mechanism. This means no
> read()/write() syscalls to guestmemfs files, and no IO to it. The only
> way to access it is to mmap the file.

There are people interested into similar things for guest_memfd.

> 
> 2. No struct page overhead: the intended use case is for systems whose
> sole job is to be a hypervisor, typically for large (multi-GiB) VMs, so
> the majority of system RAM would be donated to this fs. We definitely
> don't want 4 KiB struct pages here as it would be a significant
> overhead. That's why guestmemfs carves the memory out in early boot and
> sets memblock flags to avoid struct page allocation. I don't know if
> hugetlbfs does anything fancy to avoid allocating PTE-level struct pages
> for its memory?

Sure, it's called HVO and can optimize out a significant portion of the 
vmemmap.

> 
> 3. guest_memfd interface: For confidential computing use-cases we need
> to provide a guest_memfd style interface so that these FDs can be used
> as a guest_memfd file in KVM memslots. Would there be interest in
> extending hugetlbfs to also support a guest_memfd style interface?
> 

"Extending hugetlbfs" sounds wrong; hugetlbfs is a blast from the past 
and not something people are particularly keen to extend for such use 
cases. :)

Instead, as Jason said, we're looking into letting guest_memfd own and 
manage large chunks of contiguous memory.

> 4. Metadata designed for persistence: guestmemfs will need to keep
> simple internal metadata data structures (limited allocations, limited
> fragmentation) so that pages can easily and efficiently be marked as
> persistent via KHO. Something like slab allocations would probably be a
> no-go as then we'd need to persist and reconstruct the slab allocator. I
> don't know how hugetlbfs structures its fs metadata but I'm guessing it
> uses the slab and does lots of small allocations so trying to retrofit
> persistence via KHO to it may be challenging.
> 
> 5. Integration with persistent IOMMU mappings: to keep DMA running
> across kexec, iommufd needs to know that the backing memory for an IOAS
> is persistent too. The idea is to do some DMA pinning of persistent
> files, which would require iommufd/guestmemfs integration - would we
> want to add this to hugetlbfs?
> 
> 6. Virtualisation-specific APIs: starting to get a bit esoteric here,
> but use-cases like being able to carve out specific chunks of memory
> from a running VM and turn it into memory for another side car VM, or
> doing post-copy LM via DMA by mapping memory into the IOMMU but taking
> page faults on the CPU. This may require virtualisation-specific ioctls
> on the files which wouldn't be generally applicable to hugetlbfs.
> 
> 7. NUMA control: a requirement is to always have correct NUMA affinity.
> While currently not implemented the idea is to extend the guestmemfs
> allocation to support specifying allocation sizes from each NUMA node at
> early boot, and then having multiple mount points, one per NUMA node (or
> something like that...). Unclear if this is something hugetlbfs would
> want.
> 
> There are probably more potential issues, but those are the ones that
> come to mind... That being said, if hugetlbfs maintainers are interested
> in going in this direction then we can definitely look at enhancing
> hugetlbfs.
> 
> I think there are two types of problems: "Would hugetlbfs want this
> functionality?" - that's the majority. An a few are "This would be hard
> with hugetlbfs!" - persistence probably falls into this category.

I'm much rather asking myself if you should instead teach/extend the 
guest_memfd concept by some of what you propose here.

At least "guest_memfd" sounds a lot like the "anonymous fd" based 
variant of guestmemfs ;)

Like we have hugetlbfs and memfd with hugetlb pages.