diff mbox series

[2/2] vfio/helpers: Align mmaps

Message ID 20241022200830.4129598-3-alex.williamson@redhat.com (mailing list archive)
State New
Headers show
Series vfio: Align mmaps | expand

Commit Message

Alex Williamson Oct. 22, 2024, 8:08 p.m. UTC
Thanks to work by Peter Xu, support is introduced in Linux v6.12 to
allow pfnmap insertions at PMD and PUD levels of the page table.  This
means that provided a properly aligned mmap, the vfio driver is able
to map MMIO at significantly larger intervals than PAGE_SIZE.  For
example on x86_64 (the only architecture currently supporting huge
pfnmaps for PUD), rather than 4KiB mappings, we can map device MMIO
using 2MiB and even 1GiB page table entries.

Typically mmap will already provide PMD aligned mappings, so devices
with moderately sized MMIO ranges, even GPUs with standard 256MiB BARs,
will already take advantage of this support.  However in order to better
support devices exposing multi-GiB MMIO, such as 3D accelerators or GPUs
with resizable BARs enabled, we need to manually align the mmap.

There doesn't seem to be a way for userspace to easily learn about PMD
and PUD mapping level sizes, therefore this takes the simple approach
to align the mapping to the power-of-two size of the region, up to 1GiB,
which is currently the maximum alignment we care about.

Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/helpers.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

Comments

Peter Xu Oct. 22, 2024, 9:50 p.m. UTC | #1
On Tue, Oct 22, 2024 at 02:08:29PM -0600, Alex Williamson wrote:
> Thanks to work by Peter Xu, support is introduced in Linux v6.12 to
> allow pfnmap insertions at PMD and PUD levels of the page table.  This
> means that provided a properly aligned mmap, the vfio driver is able
> to map MMIO at significantly larger intervals than PAGE_SIZE.  For
> example on x86_64 (the only architecture currently supporting huge
> pfnmaps for PUD), rather than 4KiB mappings, we can map device MMIO
> using 2MiB and even 1GiB page table entries.
> 
> Typically mmap will already provide PMD aligned mappings, so devices
> with moderately sized MMIO ranges, even GPUs with standard 256MiB BARs,
> will already take advantage of this support.  However in order to better
> support devices exposing multi-GiB MMIO, such as 3D accelerators or GPUs
> with resizable BARs enabled, we need to manually align the mmap.
> 
> There doesn't seem to be a way for userspace to easily learn about PMD
> and PUD mapping level sizes, therefore this takes the simple approach
> to align the mapping to the power-of-two size of the region, up to 1GiB,
> which is currently the maximum alignment we care about.
> 
> Cc: Peter Xu <peterx@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

For the longer term, maybe QEMU can provide a function to reserve a range
of mmap with some specific alignment requirement.  For example, currently
qemu_ram_mmap() does mostly the same thing (and it hides a hugetlb fix on
ppc only with 7197fb4058, which isn't a concern here).  Then the complexity
can hide in that function.  Kind of a comment for the future only.

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks!

> ---
>  hw/vfio/helpers.c | 32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
> index b9e606e364a2..913796f437f8 100644
> --- a/hw/vfio/helpers.c
> +++ b/hw/vfio/helpers.c
> @@ -27,6 +27,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "qemu/error-report.h"
> +#include "qemu/units.h"
>  #include "monitor/monitor.h"
>  
>  /*
> @@ -406,8 +407,35 @@ int vfio_region_mmap(VFIORegion *region)
>      prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
>  
>      for (i = 0; i < region->nr_mmaps; i++) {
> -        region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
> -                                     MAP_SHARED, region->vbasedev->fd,
> +        size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB);
> +        void *map_base, *map_align;
> +
> +        /*
> +         * Align the mmap for more efficient mapping in the kernel.  Ideally
> +         * we'd know the PMD and PUD mapping sizes to use as discrete alignment
> +         * intervals, but we don't.  As of Linux v6.12, the largest PUD size
> +         * supporting huge pfnmap is 1GiB (ARCH_SUPPORTS_PUD_PFNMAP is only set
> +         * on x86_64).  Align by power-of-two size, capped at 1GiB.
> +         *
> +         * NB. qemu_memalign() and friends actually allocate memory, whereas
> +         * the region size here can exceed host memory, therefore we manually
> +         * create an oversized anonymous mapping and clean it up for alignment.
> +         */
> +        map_base = mmap(0, region->mmaps[i].size + align, PROT_NONE,
> +                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +        if (map_base == MAP_FAILED) {
> +            ret = -errno;
> +            goto no_mmap;
> +        }
> +
> +        map_align = (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align);
> +        munmap(map_base, map_align - map_base);
> +        munmap(map_align + region->mmaps[i].size,
> +               align - (map_align - map_base));
> +
> +        region->mmaps[i].mmap = mmap(map_align, region->mmaps[i].size, prot,
> +                                     MAP_SHARED | MAP_FIXED,
> +                                     region->vbasedev->fd,
>                                       region->fd_offset +
>                                       region->mmaps[i].offset);
>          if (region->mmaps[i].mmap == MAP_FAILED) {
> -- 
> 2.46.2
>
Cédric Le Goater Oct. 23, 2024, 12:44 p.m. UTC | #2
On 10/22/24 22:08, Alex Williamson wrote:
> Thanks to work by Peter Xu, support is introduced in Linux v6.12 to
> allow pfnmap insertions at PMD and PUD levels of the page table.  This
> means that provided a properly aligned mmap, the vfio driver is able
> to map MMIO at significantly larger intervals than PAGE_SIZE.  For
> example on x86_64 (the only architecture currently supporting huge
> pfnmaps for PUD), rather than 4KiB mappings, we can map device MMIO
> using 2MiB and even 1GiB page table entries.
> 
> Typically mmap will already provide PMD aligned mappings, so devices
> with moderately sized MMIO ranges, even GPUs with standard 256MiB BARs,
> will already take advantage of this support.  However in order to better
> support devices exposing multi-GiB MMIO, such as 3D accelerators or GPUs
> with resizable BARs enabled, we need to manually align the mmap.
> 
> There doesn't seem to be a way for userspace to easily learn about PMD
> and PUD mapping level sizes, therefore this takes the simple approach
> to align the mapping to the power-of-two size of the region, up to 1GiB,
> which is currently the maximum alignment we care about.


Couldn't we inspect /sys/kernel/mm/hugepages/ to get the sizes ?


> Cc: Peter Xu <peterx@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

anyhow,


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.



> ---
>   hw/vfio/helpers.c | 32 ++++++++++++++++++++++++++++++--
>   1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
> index b9e606e364a2..913796f437f8 100644
> --- a/hw/vfio/helpers.c
> +++ b/hw/vfio/helpers.c
> @@ -27,6 +27,7 @@
>   #include "trace.h"
>   #include "qapi/error.h"
>   #include "qemu/error-report.h"
> +#include "qemu/units.h"
>   #include "monitor/monitor.h"
>   
>   /*
> @@ -406,8 +407,35 @@ int vfio_region_mmap(VFIORegion *region)
>       prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
>   
>       for (i = 0; i < region->nr_mmaps; i++) {
> -        region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
> -                                     MAP_SHARED, region->vbasedev->fd,
> +        size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB);
> +        void *map_base, *map_align;
> +
> +        /*
> +         * Align the mmap for more efficient mapping in the kernel.  Ideally
> +         * we'd know the PMD and PUD mapping sizes to use as discrete alignment
> +         * intervals, but we don't.  As of Linux v6.12, the largest PUD size
> +         * supporting huge pfnmap is 1GiB (ARCH_SUPPORTS_PUD_PFNMAP is only set
> +         * on x86_64).  Align by power-of-two size, capped at 1GiB.
> +         *
> +         * NB. qemu_memalign() and friends actually allocate memory, whereas
> +         * the region size here can exceed host memory, therefore we manually
> +         * create an oversized anonymous mapping and clean it up for alignment.
> +         */
> +        map_base = mmap(0, region->mmaps[i].size + align, PROT_NONE,
> +                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +        if (map_base == MAP_FAILED) {
> +            ret = -errno;
> +            goto no_mmap;
> +        }
> +
> +        map_align = (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align);
> +        munmap(map_base, map_align - map_base);
> +        munmap(map_align + region->mmaps[i].size,
> +               align - (map_align - map_base));
> +
> +        region->mmaps[i].mmap = mmap(map_align, region->mmaps[i].size, prot,
> +                                     MAP_SHARED | MAP_FIXED,
> +                                     region->vbasedev->fd,
>                                        region->fd_offset +
>                                        region->mmaps[i].offset);
>           if (region->mmaps[i].mmap == MAP_FAILED) {
Alex Williamson Oct. 23, 2024, 1:55 p.m. UTC | #3
On Wed, 23 Oct 2024 14:44:19 +0200
Cédric Le Goater <clg@redhat.com> wrote:

> On 10/22/24 22:08, Alex Williamson wrote:
> > Thanks to work by Peter Xu, support is introduced in Linux v6.12 to
> > allow pfnmap insertions at PMD and PUD levels of the page table.  This
> > means that provided a properly aligned mmap, the vfio driver is able
> > to map MMIO at significantly larger intervals than PAGE_SIZE.  For
> > example on x86_64 (the only architecture currently supporting huge
> > pfnmaps for PUD), rather than 4KiB mappings, we can map device MMIO
> > using 2MiB and even 1GiB page table entries.
> > 
> > Typically mmap will already provide PMD aligned mappings, so devices
> > with moderately sized MMIO ranges, even GPUs with standard 256MiB BARs,
> > will already take advantage of this support.  However in order to better
> > support devices exposing multi-GiB MMIO, such as 3D accelerators or GPUs
> > with resizable BARs enabled, we need to manually align the mmap.
> > 
> > There doesn't seem to be a way for userspace to easily learn about PMD
> > and PUD mapping level sizes, therefore this takes the simple approach
> > to align the mapping to the power-of-two size of the region, up to 1GiB,
> > which is currently the maximum alignment we care about.  
> 
> 
> Couldn't we inspect /sys/kernel/mm/hugepages/ to get the sizes ?

Sifting through sysfs doesn't seem like a great solution if we want to
support running QEMU in a sandbox with limited access, but also
hugepage sizes don't seem to strictly map to PMD and PUD page table
levels on all platforms.  For instance ARM seems to support an
assortment of hugepage sizes, some of which appear to be available via
contiguous page hinting rather than actual page table level sizes.  At
that point we're still approximating the actual discrete mapping
intervals, but at a substantial increase in complexity, unless we
already have dependencies and existing code that can be leveraged.

> > Cc: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>  
> 
> anyhow,
> 
> 
> Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks!

Alex
 
> > ---
> >   hw/vfio/helpers.c | 32 ++++++++++++++++++++++++++++++--
> >   1 file changed, 30 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
> > index b9e606e364a2..913796f437f8 100644
> > --- a/hw/vfio/helpers.c
> > +++ b/hw/vfio/helpers.c
> > @@ -27,6 +27,7 @@
> >   #include "trace.h"
> >   #include "qapi/error.h"
> >   #include "qemu/error-report.h"
> > +#include "qemu/units.h"
> >   #include "monitor/monitor.h"
> >   
> >   /*
> > @@ -406,8 +407,35 @@ int vfio_region_mmap(VFIORegion *region)
> >       prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
> >   
> >       for (i = 0; i < region->nr_mmaps; i++) {
> > -        region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
> > -                                     MAP_SHARED, region->vbasedev->fd,
> > +        size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB);
> > +        void *map_base, *map_align;
> > +
> > +        /*
> > +         * Align the mmap for more efficient mapping in the kernel.  Ideally
> > +         * we'd know the PMD and PUD mapping sizes to use as discrete alignment
> > +         * intervals, but we don't.  As of Linux v6.12, the largest PUD size
> > +         * supporting huge pfnmap is 1GiB (ARCH_SUPPORTS_PUD_PFNMAP is only set
> > +         * on x86_64).  Align by power-of-two size, capped at 1GiB.
> > +         *
> > +         * NB. qemu_memalign() and friends actually allocate memory, whereas
> > +         * the region size here can exceed host memory, therefore we manually
> > +         * create an oversized anonymous mapping and clean it up for alignment.
> > +         */
> > +        map_base = mmap(0, region->mmaps[i].size + align, PROT_NONE,
> > +                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > +        if (map_base == MAP_FAILED) {
> > +            ret = -errno;
> > +            goto no_mmap;
> > +        }
> > +
> > +        map_align = (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align);
> > +        munmap(map_base, map_align - map_base);
> > +        munmap(map_align + region->mmaps[i].size,
> > +               align - (map_align - map_base));
> > +
> > +        region->mmaps[i].mmap = mmap(map_align, region->mmaps[i].size, prot,
> > +                                     MAP_SHARED | MAP_FIXED,
> > +                                     region->vbasedev->fd,
> >                                        region->fd_offset +
> >                                        region->mmaps[i].offset);
> >           if (region->mmaps[i].mmap == MAP_FAILED) {  
>
diff mbox series

Patch

diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
index b9e606e364a2..913796f437f8 100644
--- a/hw/vfio/helpers.c
+++ b/hw/vfio/helpers.c
@@ -27,6 +27,7 @@ 
 #include "trace.h"
 #include "qapi/error.h"
 #include "qemu/error-report.h"
+#include "qemu/units.h"
 #include "monitor/monitor.h"
 
 /*
@@ -406,8 +407,35 @@  int vfio_region_mmap(VFIORegion *region)
     prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
 
     for (i = 0; i < region->nr_mmaps; i++) {
-        region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
-                                     MAP_SHARED, region->vbasedev->fd,
+        size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB);
+        void *map_base, *map_align;
+
+        /*
+         * Align the mmap for more efficient mapping in the kernel.  Ideally
+         * we'd know the PMD and PUD mapping sizes to use as discrete alignment
+         * intervals, but we don't.  As of Linux v6.12, the largest PUD size
+         * supporting huge pfnmap is 1GiB (ARCH_SUPPORTS_PUD_PFNMAP is only set
+         * on x86_64).  Align by power-of-two size, capped at 1GiB.
+         *
+         * NB. qemu_memalign() and friends actually allocate memory, whereas
+         * the region size here can exceed host memory, therefore we manually
+         * create an oversized anonymous mapping and clean it up for alignment.
+         */
+        map_base = mmap(0, region->mmaps[i].size + align, PROT_NONE,
+                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+        if (map_base == MAP_FAILED) {
+            ret = -errno;
+            goto no_mmap;
+        }
+
+        map_align = (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align);
+        munmap(map_base, map_align - map_base);
+        munmap(map_align + region->mmaps[i].size,
+               align - (map_align - map_base));
+
+        region->mmaps[i].mmap = mmap(map_align, region->mmaps[i].size, prot,
+                                     MAP_SHARED | MAP_FIXED,
+                                     region->vbasedev->fd,
                                      region->fd_offset +
                                      region->mmaps[i].offset);
         if (region->mmaps[i].mmap == MAP_FAILED) {