Message ID | 1681141583-87816-1-git-send-email-steven.sistare@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | util/mmap: optimize qemu_ram_mmap() alignment | expand |
On 10.04.23 17:46, Steve Sistare wrote: > Guest RAM created with memory-backend-memfd is aligned to a > QEMU_VMALLOC_ALIGN=2M boundary, and memory-backend-memfd does not support > the "align" parameter to change the default. This is sub-optimal on > aarch64 kernels with a 64 KB page size and 512 MB huge page size, as the > range will not be backed by huge pages. Moreover, any shared allocation > using qemu_ram_mmap() will be sub-optimal on such a system if the align > parameter is less than 512 MB. > > The kernel is smart enough to return a hugely aligned pointer for MAP_SHARED > mappings when /sys/kernel/mm/transparent_hugepage/shmem_enabled allows it. > However, the qemu function qemu_ram_mmap() mmap's twice to perform its own > alignment: > > guardptr = mmap(0, total, PROT_NONE, flags, ... > flags |= shared ? MAP_SHARED : MAP_PRIVATE; > ptr = mmap(guardptr + offset, size, prot, flags | map_sync_flags, ... > > On the first call, flags has MAP_PRIVATE, hence the kernel does not apply > its shared memory policy, and returns a non-huge-aligned guardptr. > > To fix, for shared mappings, pass MAP_SHARED to both mmap calls. > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com> > Reviewed-by: Joao Martins <joao.m.martins@oracle.com> > --- > util/mmap-alloc.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c > index 5ed7d29..37a0d1e 100644 > --- a/util/mmap-alloc.c > +++ b/util/mmap-alloc.c > @@ -121,7 +121,7 @@ static bool map_noreserve_effective(int fd, uint32_t qemu_map_flags) > * Reserve a new memory region of the requested size to be used for mapping > * from the given fd (if any). > */ > -static void *mmap_reserve(size_t size, int fd) > +static void *mmap_reserve(size_t size, int fd, int final_flags) > { > int flags = MAP_PRIVATE; > > @@ -144,6 +144,7 @@ static void *mmap_reserve(size_t size, int fd) > #else > fd = -1; > flags |= MAP_ANONYMOUS; > + flags |= final_flags & MAP_SHARED; Setting both, MAP_PRIVATE and MAP_SHARED sure sounds very wrong. The traditional way to handle that is via QEMU_VMALLOC_ALIGN. I think you'd actually want to turn QEMU_VMALLOC_ALIGN into a function that is able to special-case like hw/virtio/virtio-mem.c:virtio_mem_default_thp_size() does for "qemu_real_host_page_size() == 64 * KiB". If you set QEMU_VMALLOC_ALIGN properly, you'll also have any DIMMs get that alignment in guest physical address space.
On 4/11/2023 3:57 AM, David Hildenbrand wrote: > On 10.04.23 17:46, Steve Sistare wrote: >> Guest RAM created with memory-backend-memfd is aligned to a >> QEMU_VMALLOC_ALIGN=2M boundary, and memory-backend-memfd does not support >> the "align" parameter to change the default. This is sub-optimal on >> aarch64 kernels with a 64 KB page size and 512 MB huge page size, as the >> range will not be backed by huge pages. Moreover, any shared allocation >> using qemu_ram_mmap() will be sub-optimal on such a system if the align >> parameter is less than 512 MB. >> >> The kernel is smart enough to return a hugely aligned pointer for MAP_SHARED >> mappings when /sys/kernel/mm/transparent_hugepage/shmem_enabled allows it. >> However, the qemu function qemu_ram_mmap() mmap's twice to perform its own >> alignment: >> >> guardptr = mmap(0, total, PROT_NONE, flags, ... >> flags |= shared ? MAP_SHARED : MAP_PRIVATE; >> ptr = mmap(guardptr + offset, size, prot, flags | map_sync_flags, ... >> >> On the first call, flags has MAP_PRIVATE, hence the kernel does not apply >> its shared memory policy, and returns a non-huge-aligned guardptr. >> >> To fix, for shared mappings, pass MAP_SHARED to both mmap calls. >> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> >> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> >> --- >> util/mmap-alloc.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c >> index 5ed7d29..37a0d1e 100644 >> --- a/util/mmap-alloc.c >> +++ b/util/mmap-alloc.c >> @@ -121,7 +121,7 @@ static bool map_noreserve_effective(int fd, uint32_t qemu_map_flags) >> * Reserve a new memory region of the requested size to be used for mapping >> * from the given fd (if any). >> */ >> -static void *mmap_reserve(size_t size, int fd) >> +static void *mmap_reserve(size_t size, int fd, int final_flags) >> { >> int flags = MAP_PRIVATE; >> @@ -144,6 +144,7 @@ static void *mmap_reserve(size_t size, int fd) >> #else >> fd = -1; >> flags |= MAP_ANONYMOUS; >> + flags |= final_flags & MAP_SHARED; > > Setting both, MAP_PRIVATE and MAP_SHARED sure sounds very wrong. Yes, thanks. I introduced that mistake when I ported the fix from an earlier qemu that did not have mmap_reserve. Should be: fd = -1; flags = MAP_ANONYMOUS; flags |= final_flags & (MAP_SHARED | MAP_PRIVATE); > The traditional way to handle that is via QEMU_VMALLOC_ALIGN. > > I think you'd actually want to turn QEMU_VMALLOC_ALIGN into a function that is able to special-case like hw/virtio/virtio-mem.c:virtio_mem_default_thp_size() does for "qemu_real_host_page_size() == 64 * KiB". > > If you set QEMU_VMALLOC_ALIGN properly, you'll also have any DIMMs get that alignment in guest physical address space. If we increase QEMU_VMALLOC_ALIGN, then all allocations will be 512MB aligned. If we make many small allocations, we could conceivably run out of VA space. Further, the processor may use low order address bits in internal hashes, and now offset X in every allocated range will have the same value for the low 29 bits, possibly causing more collisions and reducing performance. Further, the huge alignment is not even needed if huge pages for shmem have been disabled in sysconfig. We could avoid that by adding logic to also consider allocation size when choosing alignment, and checking sysconfig tunables. Or, we can just let the kernel do so by telling it the truth about memory flavor when calling mmap. - Steve
On 11.04.23 16:39, Steven Sistare wrote: > On 4/11/2023 3:57 AM, David Hildenbrand wrote: >> On 10.04.23 17:46, Steve Sistare wrote: >>> Guest RAM created with memory-backend-memfd is aligned to a >>> QEMU_VMALLOC_ALIGN=2M boundary, and memory-backend-memfd does not support >>> the "align" parameter to change the default. This is sub-optimal on >>> aarch64 kernels with a 64 KB page size and 512 MB huge page size, as the >>> range will not be backed by huge pages. Moreover, any shared allocation >>> using qemu_ram_mmap() will be sub-optimal on such a system if the align >>> parameter is less than 512 MB. >>> >>> The kernel is smart enough to return a hugely aligned pointer for MAP_SHARED >>> mappings when /sys/kernel/mm/transparent_hugepage/shmem_enabled allows it. >>> However, the qemu function qemu_ram_mmap() mmap's twice to perform its own >>> alignment: >>> >>> guardptr = mmap(0, total, PROT_NONE, flags, ... >>> flags |= shared ? MAP_SHARED : MAP_PRIVATE; >>> ptr = mmap(guardptr + offset, size, prot, flags | map_sync_flags, ... >>> >>> On the first call, flags has MAP_PRIVATE, hence the kernel does not apply >>> its shared memory policy, and returns a non-huge-aligned guardptr. >>> >>> To fix, for shared mappings, pass MAP_SHARED to both mmap calls. >>> >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> >>> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> >>> --- >>> util/mmap-alloc.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c >>> index 5ed7d29..37a0d1e 100644 >>> --- a/util/mmap-alloc.c >>> +++ b/util/mmap-alloc.c >>> @@ -121,7 +121,7 @@ static bool map_noreserve_effective(int fd, uint32_t qemu_map_flags) >>> * Reserve a new memory region of the requested size to be used for mapping >>> * from the given fd (if any). >>> */ >>> -static void *mmap_reserve(size_t size, int fd) >>> +static void *mmap_reserve(size_t size, int fd, int final_flags) >>> { >>> int flags = MAP_PRIVATE; >>> @@ -144,6 +144,7 @@ static void *mmap_reserve(size_t size, int fd) >>> #else >>> fd = -1; >>> flags |= MAP_ANONYMOUS; >>> + flags |= final_flags & MAP_SHARED; >> >> Setting both, MAP_PRIVATE and MAP_SHARED sure sounds very wrong. > > Yes, thanks. I introduced that mistake when I ported the fix from an earlier qemu that did not > have mmap_reserve. Should be: > > fd = -1; > flags = MAP_ANONYMOUS; > flags |= final_flags & (MAP_SHARED | MAP_PRIVATE); > Better, but I still don't like bringing in some Linux kernel MAP_SHARED logic that apparently does better right now in some case right now ... because this is supposed to work on POSIX and ideally optimizes on various systems+configurations. >> The traditional way to handle that is via QEMU_VMALLOC_ALIGN. >> >> I think you'd actually want to turn QEMU_VMALLOC_ALIGN into a function that is able to special-case like hw/virtio/virtio-mem.c:virtio_mem_default_thp_size() does for "qemu_real_host_page_size() == 64 * KiB". >> >> If you set QEMU_VMALLOC_ALIGN properly, you'll also have any DIMMs get that alignment in guest physical address space. > > If we increase QEMU_VMALLOC_ALIGN, then all allocations will be 512MB aligned. If we make many small > allocations, we could conceivably run out of VA space. Further, the processor may use low order Running out of VA space? I'm not so sure. We are talking about the qemu_ram_mmap() function here ... (well, and file_ram_alloc() which only uses QEMU_VMALLOC_ALIGN on s390x). This is all about guest RAM, although the name suggests otherwise. > address bits in internal hashes, and now offset X in every allocated range will have the same value for > the low 29 bits, possibly causing more collisions and reducing performance. Further, the huge alignment > is not even needed if huge pages for shmem have been disabled in sysconfig. I'm not convinced this is worth optimizing, or even special-casing just for arm64 when we're only talking about mapping guest RAM. Besides, what about ordinary anon THP? See below. > > We could avoid that by adding logic to also consider allocation size when choosing alignment, and > checking sysconfig tunables. Or, we can just let the kernel do so by telling it the truth about > memory flavor when calling mmap. It's probably best to not trust the kernel to do the right alignment thing because we learned that it's not easy to get such optimizations into the kernel: https://lkml.kernel.org/r/20220809142457.4751229f@imladris.surriel.com got reverted again, which would have done the right thing for anon THP. I'd suggest that we either hard-code it for arm64 as well, by adjusting QEMU_VMALLOC_ALIGN (if we don't care about giving it a better name and making it a function). #elif defined(__linux__) && defined(__aarch64__) # define QEMU_VMALLOC_ALIGN((qemu_real_host_page_size() == 64 * KiB) ? 512 MiB : 2 MiB) #elif ... Alternatively, we could rewrite into a proper function that caches the result and tries looking up the actual host THP size using "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size" under Linux before falling back to the known defaults. Sure, we could optimize in the caller (allocation size smaller than alignment?), but that requires good justification.
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c index 5ed7d29..37a0d1e 100644 --- a/util/mmap-alloc.c +++ b/util/mmap-alloc.c @@ -121,7 +121,7 @@ static bool map_noreserve_effective(int fd, uint32_t qemu_map_flags) * Reserve a new memory region of the requested size to be used for mapping * from the given fd (if any). */ -static void *mmap_reserve(size_t size, int fd) +static void *mmap_reserve(size_t size, int fd, int final_flags) { int flags = MAP_PRIVATE; @@ -144,6 +144,7 @@ static void *mmap_reserve(size_t size, int fd) #else fd = -1; flags |= MAP_ANONYMOUS; + flags |= final_flags & MAP_SHARED; #endif return mmap(0, size, PROT_NONE, flags, fd, 0); @@ -232,7 +233,7 @@ void *qemu_ram_mmap(int fd, */ total = size + align; - guardptr = mmap_reserve(total, fd); + guardptr = mmap_reserve(total, fd, qemu_map_flags); if (guardptr == MAP_FAILED) { return MAP_FAILED; }