diff mbox series

hvf: arm: Allow creating VMs with > 63GB of RAM on macOS 15+

Message ID 20240718230031.69641-1-danny_canter@apple.com (mailing list archive)
State New, archived
Headers show
Series hvf: arm: Allow creating VMs with > 63GB of RAM on macOS 15+ | expand

Commit Message

Danny Canter July 18, 2024, 11 p.m. UTC
This patch's main focus is to enable creating VMs with > 63GB
of RAM on Apple Silicon machines by using some new HVF APIs. In
pursuit of this a couple of things related to how we handle the
physical address range we expose to guests were altered:

The default IPA size on all Apple Silicon machines for HVF is
currently 36 bits. This bars making a VM with > 63GB (as RAM
starts at 1GB in the memory map) of RAM. Currently, to get the
IPA size we were reading id_aa64mmfr0_el1's PARange field
from a newly made vcpu. Unfortunately HVF just returns the
hosts PARange directly for the initial value and not the IPA
size that will actually back the VM, so we believe we have much
more address space than we actually do today it seems.

Starting in macOS 13.0 some APIs were introduced to be able to
query the maximum IPA size the kernel supports, and to set the IPA
size for a given VM. However, this still has a couple of issues
on < macOS 15. Up until macOS 15 (and if the hardware supported
it) the max IPA size was 39 bits which is not a valid PARange
value, so we can't clamp down what we advertise in the vcpu's
id_aa64mmfr0_el1 to our IPA size. Starting in macOS 15 however,
the maximum IPA size is 40 bits (if it's supported in the hardware
as well) which is also a valid PARange value, so we can set our IPA
size to the maximum as well as clamp down the PARange we advertise
to the guest. This allows VMs with 63+ GB of ram and should fix the
oddness of the PARange situation as well.

For the implementation of this I've decided to only bump the IPA
size if the amount of RAM requested is encroaching on the default IPA
size of 36 bits, as at 40 bits of IPA space we have to have one extra
level of stage2 page tables.

Signed-off-by: Danny Canter <danny_canter@apple.com>
Reviewed-by: Cameron Esfahani <dirty@apple.com>
---
 accel/hvf/hvf-accel-ops.c |   6 +-
 include/sysemu/hvf_int.h  |   1 +
 target/arm/hvf/hvf.c      | 113 ++++++++++++++++++++++++++++++++++++++
 target/arm/internals.h    |  19 +++++++
 target/arm/ptw.c          |  20 +++++++
 target/i386/hvf/hvf.c     |   5 ++
 6 files changed, 159 insertions(+), 5 deletions(-)

Comments

Peter Maydell July 29, 2024, 4:27 p.m. UTC | #1
On Fri, 19 Jul 2024 at 00:03, Danny Canter <danny_canter@apple.com> wrote:
>
> This patch's main focus is to enable creating VMs with > 63GB
> of RAM on Apple Silicon machines by using some new HVF APIs. In
> pursuit of this a couple of things related to how we handle the
> physical address range we expose to guests were altered:

Hi -- this is just a note to say that this patch is on my
todo list to review, I just haven't got to it yet. (Unfortunately
it's just missed the cutoff for the upcoming 9.1 release, and
I've been prioritising the for-9.1 stuff.)

I did see one thing from an initial quick eyeball, but
don't bother respinning the patchset just to change that:
I'll do an actual review hopefully in the next week or so.

> +static uint32_t hvf_get_default_ipa_bit_size(void)
> +{
> +    uint32_t default_ipa_size = 36;
> +#if defined(MAC_OS_VERSION_13_0) && \
> +    MAC_OS_X_VERSION_MIN_REQUIRED >= MAC_OS_VERSION_13_0
> +    hv_return_t ret = hv_vm_config_get_default_ipa_size(&default_ipa_size);
> +    assert_hvf_ok(ret);
> +#endif

You can assume we have at least macos 13 or better -- our
minimum supported build platform is at least that new,
and we already dropped some 10.12-and-earlier compat
ifdefs (see commit 2d27c91e2b72ac7).

> +    return default_ipa_size;
> +}

thanks
-- PMM
Danny Canter July 30, 2024, 9:53 p.m. UTC | #2
Thanks Peter! 

For the macOS 13 comment just so I’m clear, you’re saying the minimum we support is
13 now so the conditional compilation for those isn’t required anymore as well? I suppose
that tracks given the wording that we support the last two macOS releases at any given
time, that kind of slipped my mind when thinking about what actually needs to be ifdef’d
here. That certainly cleans things up a bit for some of the functions in this patch.

— Danny 

> On Jul 29, 2024, at 9:27 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Fri, 19 Jul 2024 at 00:03, Danny Canter <danny_canter@apple.com> wrote:
>> 
>> This patch's main focus is to enable creating VMs with > 63GB
>> of RAM on Apple Silicon machines by using some new HVF APIs. In
>> pursuit of this a couple of things related to how we handle the
>> physical address range we expose to guests were altered:
> 
> Hi -- this is just a note to say that this patch is on my
> todo list to review, I just haven't got to it yet. (Unfortunately
> it's just missed the cutoff for the upcoming 9.1 release, and
> I've been prioritising the for-9.1 stuff.)
> 
> I did see one thing from an initial quick eyeball, but
> don't bother respinning the patchset just to change that:
> I'll do an actual review hopefully in the next week or so.
> 
>> +static uint32_t hvf_get_default_ipa_bit_size(void)
>> +{
>> +    uint32_t default_ipa_size = 36;
>> +#if defined(MAC_OS_VERSION_13_0) && \
>> +    MAC_OS_X_VERSION_MIN_REQUIRED >= MAC_OS_VERSION_13_0
>> +    hv_return_t ret = hv_vm_config_get_default_ipa_size(&default_ipa_size);
>> +    assert_hvf_ok(ret);
>> +#endif
> 
> You can assume we have at least macos 13 or better -- our
> minimum supported build platform is at least that new,
> and we already dropped some 10.12-and-earlier compat
> ifdefs (see commit 2d27c91e2b72ac7).
> 
>> +    return default_ipa_size;
>> +}
> 
> thanks
> -- PMM
Peter Maydell July 31, 2024, 8:26 a.m. UTC | #3
On Tue, 30 Jul 2024 at 22:53, Danny Canter <danny_canter@apple.com> wrote:
>
> Thanks Peter!
>
> For the macOS 13 comment just so I’m clear, you’re saying the minimum we support is
> 13 now so the conditional compilation for those isn’t required anymore as well? I suppose
> that tracks given the wording that we support the last two macOS releases at any given
> time, that kind of slipped my mind when thinking about what actually needs to be ifdef’d
> here. That certainly cleans things up a bit for some of the functions in this patch.

Yep, that's the idea. We tend to leave the conditionals in a bit beyond
literally just last two releases, so there's a period of "not supported but
it probably still works" for a release that's fallen off the end of our
support list. But in this case since we already have code in QEMU that
requires 13 to even compile, there's no need to add new 13-or-better ifdefs
in this new code.

-- PMM
Peter Maydell Aug. 12, 2024, 2:52 p.m. UTC | #4
On Fri, 19 Jul 2024 at 00:03, Danny Canter <danny_canter@apple.com> wrote:
>
> This patch's main focus is to enable creating VMs with > 63GB
> of RAM on Apple Silicon machines by using some new HVF APIs. In
> pursuit of this a couple of things related to how we handle the
> physical address range we expose to guests were altered:
>
> The default IPA size on all Apple Silicon machines for HVF is
> currently 36 bits. This bars making a VM with > 63GB (as RAM
> starts at 1GB in the memory map) of RAM. Currently, to get the
> IPA size we were reading id_aa64mmfr0_el1's PARange field
> from a newly made vcpu. Unfortunately HVF just returns the
> hosts PARange directly for the initial value and not the IPA
> size that will actually back the VM, so we believe we have much
> more address space than we actually do today it seems.

So just to check my understanding, this means that with current
QEMU, on all Apple hardware, attempting to create a VM with
more than 63 GB of RAM will always fail in the same way,
regardless of whether that CPU's hardware has a 36 bit IPA
or a larger IPA? That is, we don't change the default IPA for the
VM, so it's 36 bits, and then the hvf command to map in the RAM
to the guest address space fails with HV_BAD_ARGUMENT, per
https://gitlab.com/qemu-project/qemu/-/issues/1816 .

> Starting in macOS 13.0 some APIs were introduced to be able to
> query the maximum IPA size the kernel supports, and to set the IPA
> size for a given VM. However, this still has a couple of issues
> on < macOS 15. Up until macOS 15 (and if the hardware supported
> it) the max IPA size was 39 bits which is not a valid PARange
> value, so we can't clamp down what we advertise in the vcpu's
> id_aa64mmfr0_el1 to our IPA size. Starting in macOS 15 however,
> the maximum IPA size is 40 bits (if it's supported in the hardware
> as well) which is also a valid PARange value, so we can set our IPA
> size to the maximum as well as clamp down the PARange we advertise
> to the guest. This allows VMs with 63+ GB of ram and should fix the
> oddness of the PARange situation as well.

So (again to clarify that I understand what's happening here)
for macos 13-14 we'll effectively continue to use a 36-bit
IPA range because we clamp the "39" value down to the next
lowest actually-valid value of 36 ? And so if you want >63GB
of memory you'll need all of:
 * a host CPU which supports at least a 40 bit IPA
   (is there a definition somewhere of which these are?)
 * macos 15
 * a QEMU with these changes

?

(That seems fine to me: I'm happy to say "get macos 15 if you
want this" rather than trying to cope with the non-standard
39 bit IPA in QEMU. We should make sure the error message in
the IPA-too-small case is comprehensible -- I think at the
moment we somewhat unhelpfully assert()...)

> For the implementation of this I've decided to only bump the IPA
> size if the amount of RAM requested is encroaching on the default IPA
> size of 36 bits, as at 40 bits of IPA space we have to have one extra
> level of stage2 page tables.
>
> Signed-off-by: Danny Canter <danny_canter@apple.com>
> Reviewed-by: Cameron Esfahani <dirty@apple.com>

> @@ -929,6 +977,66 @@ void hvf_arch_vcpu_destroy(CPUState *cpu)
>  {
>  }
>
> +hv_return_t hvf_arch_vm_create(MachineState *ms)
> +{
> +    uint32_t default_ipa_size = hvf_get_default_ipa_bit_size();
> +    uint32_t max_ipa_size = hvf_get_max_ipa_bit_size();
> +    hv_return_t ret;
> +
> +    chosen_ipa_bit_size = default_ipa_size;
> +
> +    /*
> +     * Set the IPA size for the VM:
> +     *
> +     * Starting from macOS 13 a new set of APIs were introduced that allow you
> +     * to query for the maximum IPA size that is supported on your system. macOS
> +     * 13 and 14's kernel both return a value less than 40 bits (typically 39,
> +     * but depends on hardware), however starting in macOS 15 and up the IPA
> +     * size supported (in the kernel at least) is up to 40 bits. A common scheme
> +     * for attempting to get the IPA size prior to the introduction of these new
> +     * APIs was to read ID_AA64MMFR0.PARange from a vcpu in the hopes that HVF
> +     * was returning the maximum IPA size in that. However, this was not the
> +     * case. HVF would return the host's PARange value directly which is
> +     * generally larger than 40 bits.
> +     *
> +     * Using that value we could set up our memory map with regions much outside
> +     * the actually supported IPA size, and also advertise a much larger
> +     * physical address space to the guest. On the hardware+OS combos where
> +     * the IPA size is less than 40 bits, but greater than 36, we also don't
> +     * have a valid PARange value to round down to before 36 bits which is
> +     * already the default.
> +     *
> +     * With that in mind, before we make the VM lets grab the maximum supported
> +     * IPA size and clamp it down to the first valid PARange value so we can
> +     * advertise the correct address size for the guest later on. Then if it's
> +     * >= 40 set this as the IPA size for the VM using the new APIs. There's a
> +     * small heuristic for actually altering the IPA size for the VM which is
> +     * if our requested RAM is encroaching on the top of our default IPA size.
> +     * This is just an optimization, as at 40 bits we need to create one more
> +     * level of stage2 page tables.
> +     */
> +#if defined(MAC_OS_VERSION_13_0) && \
> +    MAC_OS_X_VERSION_MIN_REQUIRED >= MAC_OS_VERSION_13_0
> +    hv_vm_config_t config = hv_vm_config_create();
> +
> +    /* In our memory map RAM starts at 1GB. */

This is not board-specific code, so you can't assume that.
The board gets to pick the memory map and where RAM starts in it.

You probably need to do something similar to what we do
in hw/arm/virt.c:virt_kvm_type() where we find out what
the best IPA the hypervisor supports is, set the board memory
map to respect that, diagnose an error if the user asked for
more RAM than fits into that IPA range, and then arrange for
the actual VM/vcpu creation to be done with the required IPA.

This is unfortunately probably going to imply a bit of extra
plumbing to be implemented for hvf -- that MachineClass::kvm_type
method is (as the name suggests) KVM specific. (Multi-patch
patchset for that, where we add the plumbing in as its own
separate patch (and/or whatever other split of functionality
into coherent chunks makes sense), rather than one-big-patch, please.)

> +    uint64_t threshold = (1ull << default_ipa_size) - (1 * GiB);
> +    if (ms->ram_size >= threshold && max_ipa_size >= FIRST_HIGHMEM_PARANGE) {
> +        ret = hv_vm_config_set_ipa_size(config, max_ipa_size);
> +        assert_hvf_ok(ret);
> +
> +        chosen_ipa_bit_size = max_ipa_size;
> +    }
> +
> +    ret = hv_vm_create(config);
> +    os_release(config);
> +#else
> +    ret = hv_vm_create(NULL);
> +#endif
> +
> +    return ret;
> +}

> +uint8_t round_down_to_parange_index(uint8_t bit_size)
> +{
> +    for (int i = ARRAY_SIZE(pamax_map) - 1; i >= 0; i--) {
> +        if (pamax_map[i] <= bit_size) {
> +            return i;
> +        }
> +    }
> +    g_assert_not_reached();
> +}
> +
> +uint8_t round_down_to_parange_bit_size(uint8_t bit_size)
> +{
> +    for (int i = ARRAY_SIZE(pamax_map) - 1; i >= 0; i--) {
> +        if (pamax_map[i] <= bit_size) {
> +            return pamax_map[i];
> +        }
> +    }
> +    g_assert_not_reached();

We could implement this as
       return pamax_map[round_down_to_parange_index(bit_size)];

and avoid having to code the loop twice, right?

thanks
-- PMM
Danny Canter Aug. 12, 2024, 10:18 p.m. UTC | #5
Peter, thanks for review! Will work on splitting this up a bit to support the plumbing you mentioned KVM does today on ARM. 

> On Aug 12, 2024, at 10:52 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Fri, 19 Jul 2024 at 00:03, Danny Canter <danny_canter@apple.com> wrote:
>> 
>> This patch's main focus is to enable creating VMs with > 63GB
>> of RAM on Apple Silicon machines by using some new HVF APIs. In
>> pursuit of this a couple of things related to how we handle the
>> physical address range we expose to guests were altered:
>> 
>> The default IPA size on all Apple Silicon machines for HVF is
>> currently 36 bits. This bars making a VM with > 63GB (as RAM
>> starts at 1GB in the memory map) of RAM. Currently, to get the
>> IPA size we were reading id_aa64mmfr0_el1's PARange field
>> from a newly made vcpu. Unfortunately HVF just returns the
>> hosts PARange directly for the initial value and not the IPA
>> size that will actually back the VM, so we believe we have much
>> more address space than we actually do today it seems.
> 
> So just to check my understanding, this means that with current
> QEMU, on all Apple hardware, attempting to create a VM with
> more than 63 GB of RAM will always fail in the same way,
> regardless of whether that CPU's hardware has a 36 bit IPA
> or a larger IPA? That is, we don't change the default IPA for the
> VM, so it's 36 bits, and then the hvf command to map in the RAM
> to the guest address space fails with HV_BAD_ARGUMENT, per
> https://gitlab.com/qemu-project/qemu/-/issues/1816 .

Spot on, yes. We default to a lower (36 bit) IPA space always, and expose the knobs starting in 13 to raise this on a per-VM
basis. We aren’t raising it today so we’d always fail when the kernel gets a hv_vm_map with an IPA past the end of our address
space.

> 
>> Starting in macOS 13.0 some APIs were introduced to be able to
>> query the maximum IPA size the kernel supports, and to set the IPA
>> size for a given VM. However, this still has a couple of issues
>> on < macOS 15. Up until macOS 15 (and if the hardware supported
>> it) the max IPA size was 39 bits which is not a valid PARange
>> value, so we can't clamp down what we advertise in the vcpu's
>> id_aa64mmfr0_el1 to our IPA size. Starting in macOS 15 however,
>> the maximum IPA size is 40 bits (if it's supported in the hardware
>> as well) which is also a valid PARange value, so we can set our IPA
>> size to the maximum as well as clamp down the PARange we advertise
>> to the guest. This allows VMs with 63+ GB of ram and should fix the
>> oddness of the PARange situation as well.
> 
> So (again to clarify that I understand what's happening here)
> for macos 13-14 we'll effectively continue to use a 36-bit
> IPA range because we clamp the "39" value down to the next
> lowest actually-valid value of 36 ? And so if you want >63GB
> of memory you'll need all of:
> * a host CPU which supports at least a 40 bit IPA
>   (is there a definition somewhere of which these are?)
> * macos 15
> * a QEMU with these changes
> 
> ?
> 
> (That seems fine to me: I'm happy to say "get macos 15 if you
> want this" rather than trying to cope with the non-standard
> 39 bit IPA in QEMU. We should make sure the error message in
> the IPA-too-small case is comprehensible -- I think at the
> moment we somewhat unhelpfully assert()...)
> 

Spot on again. We didn’t want to advertise a larger PA range to the guest than what is actually backing the VM, so for a “correct” world
macOS 15 would be required. You’d get 40 bits of IPA space, and we finally line up with a valid ARM PARange value. As for whether
there’s a list of what SoC’s support what IPA size, I don’t believe so. All of the pro/max SoC’s do iirc, but I’d just direct folks to write a
program that calls `hv_vm_config_get_max_ipa_size` to confirm. There’s `sysctl -a kern.hv`, but I’d avoid recommending this in case
the hv API does some extra munging with the value this reports. That hv_vm_config_ API should truly be the source of truth on any given
machine.

As for the error message we report, I’d have to remember what the error message is today, but it seemed somewhat reasonable after this patch
if I recall. It was in a codepath that already existed (and was added for kvm it seemed) and was checking if any part of the memory map exceeded the
maximum IPA size.

>> For the implementation of this I've decided to only bump the IPA
>> size if the amount of RAM requested is encroaching on the default IPA
>> size of 36 bits, as at 40 bits of IPA space we have to have one extra
>> level of stage2 page tables.
>> 
>> Signed-off-by: Danny Canter <danny_canter@apple.com>
>> Reviewed-by: Cameron Esfahani <dirty@apple.com>
> 
>> @@ -929,6 +977,66 @@ void hvf_arch_vcpu_destroy(CPUState *cpu)
>> {
>> }
>> 
>> +hv_return_t hvf_arch_vm_create(MachineState *ms)
>> +{
>> +    uint32_t default_ipa_size = hvf_get_default_ipa_bit_size();
>> +    uint32_t max_ipa_size = hvf_get_max_ipa_bit_size();
>> +    hv_return_t ret;
>> +
>> +    chosen_ipa_bit_size = default_ipa_size;
>> +
>> +    /*
>> +     * Set the IPA size for the VM:
>> +     *
>> +     * Starting from macOS 13 a new set of APIs were introduced that allow you
>> +     * to query for the maximum IPA size that is supported on your system. macOS
>> +     * 13 and 14's kernel both return a value less than 40 bits (typically 39,
>> +     * but depends on hardware), however starting in macOS 15 and up the IPA
>> +     * size supported (in the kernel at least) is up to 40 bits. A common scheme
>> +     * for attempting to get the IPA size prior to the introduction of these new
>> +     * APIs was to read ID_AA64MMFR0.PARange from a vcpu in the hopes that HVF
>> +     * was returning the maximum IPA size in that. However, this was not the
>> +     * case. HVF would return the host's PARange value directly which is
>> +     * generally larger than 40 bits.
>> +     *
>> +     * Using that value we could set up our memory map with regions much outside
>> +     * the actually supported IPA size, and also advertise a much larger
>> +     * physical address space to the guest. On the hardware+OS combos where
>> +     * the IPA size is less than 40 bits, but greater than 36, we also don't
>> +     * have a valid PARange value to round down to before 36 bits which is
>> +     * already the default.
>> +     *
>> +     * With that in mind, before we make the VM lets grab the maximum supported
>> +     * IPA size and clamp it down to the first valid PARange value so we can
>> +     * advertise the correct address size for the guest later on. Then if it's
>> +     * >= 40 set this as the IPA size for the VM using the new APIs. There's a
>> +     * small heuristic for actually altering the IPA size for the VM which is
>> +     * if our requested RAM is encroaching on the top of our default IPA size.
>> +     * This is just an optimization, as at 40 bits we need to create one more
>> +     * level of stage2 page tables.
>> +     */
>> +#if defined(MAC_OS_VERSION_13_0) && \
>> +    MAC_OS_X_VERSION_MIN_REQUIRED >= MAC_OS_VERSION_13_0
>> +    hv_vm_config_t config = hv_vm_config_create();
>> +
>> +    /* In our memory map RAM starts at 1GB. */
> 
> This is not board-specific code, so you can't assume that.
> The board gets to pick the memory map and where RAM starts in it.
> 
> You probably need to do something similar to what we do
> in hw/arm/virt.c:virt_kvm_type() where we find out what
> the best IPA the hypervisor supports is, set the board memory
> map to respect that, diagnose an error if the user asked for
> more RAM than fits into that IPA range, and then arrange for
> the actual VM/vcpu creation to be done with the required IPA.
> 
> This is unfortunately probably going to imply a bit of extra
> plumbing to be implemented for hvf -- that MachineClass::kvm_type
> method is (as the name suggests) KVM specific. (Multi-patch
> patchset for that, where we add the plumbing in as its own
> separate patch (and/or whatever other split of functionality
> into coherent chunks makes sense), rather than one-big-patch, please.)

That’s perfectly fine, I’ll try and see how the plumbing was done for KVM and try and emulate where it makes sense
for HVF. Agree though, that’d definitely push this into multi-patch territory. Curious if you think what’s here today should
be multiple patches or the current work seems fine in one?

> 
>> +    uint64_t threshold = (1ull << default_ipa_size) - (1 * GiB);
>> +    if (ms->ram_size >= threshold && max_ipa_size >= FIRST_HIGHMEM_PARANGE) {
>> +        ret = hv_vm_config_set_ipa_size(config, max_ipa_size);
>> +        assert_hvf_ok(ret);
>> +
>> +        chosen_ipa_bit_size = max_ipa_size;
>> +    }
>> +
>> +    ret = hv_vm_create(config);
>> +    os_release(config);
>> +#else
>> +    ret = hv_vm_create(NULL);
>> +#endif
>> +
>> +    return ret;
>> +}
> 
>> +uint8_t round_down_to_parange_index(uint8_t bit_size)
>> +{
>> +    for (int i = ARRAY_SIZE(pamax_map) - 1; i >= 0; i--) {
>> +        if (pamax_map[i] <= bit_size) {
>> +            return i;
>> +        }
>> +    }
>> +    g_assert_not_reached();
>> +}
>> +
>> +uint8_t round_down_to_parange_bit_size(uint8_t bit_size)
>> +{
>> +    for (int i = ARRAY_SIZE(pamax_map) - 1; i >= 0; i--) {
>> +        if (pamax_map[i] <= bit_size) {
>> +            return pamax_map[i];
>> +        }
>> +    }
>> +    g_assert_not_reached();
> 
> We could implement this as
>       return pamax_map[round_down_to_parange_index(bit_size)];
> 
> and avoid having to code the loop twice, right?

Yes, my copy and paste seems dumb reading it back now :)

> 
> thanks
> -- PMM
Peter Maydell Aug. 13, 2024, 9:31 a.m. UTC | #6
On Mon, 12 Aug 2024 at 23:18, Danny Canter <danny_canter@apple.com> wrote:
> On Aug 12, 2024, at 10:52 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> > This is unfortunately probably going to imply a bit of extra
> > plumbing to be implemented for hvf -- that MachineClass::kvm_type
> > method is (as the name suggests) KVM specific. (Multi-patch
> > patchset for that, where we add the plumbing in as its own
> > separate patch (and/or whatever other split of functionality
> > into coherent chunks makes sense), rather than one-big-patch, please.)
>
> That’s perfectly fine, I’ll try and see how the plumbing was done
> for KVM and try and emulate where it makes sense
> for HVF. Agree though, that’d definitely push this into multi-patch
> territory. Curious if you think what’s here today should
> be multiple patches or the current work seems fine in one?

I think it was fine as one patch. My personal preference
when I write code tends to go for more-smaller-patches
over fewer-larger-patches, so I might have for example
split out "Add hvf_arch_vm_create()" into its own
patch, but that's very borderline, and I wouldn't ask for
that change at code review time unless the patch as a whole
was too big and unwieldy and I was looking for places to
suggest a split into multiple patches.

-- PMM
Danny Canter Aug. 17, 2024, 12:36 a.m. UTC | #7
Peter, thought I’d send this little snippet before getting the rest of V2 done in case anyone hates this :). I tried to take a similar approach to kvm_type,
but I’m not sure if this will be looked upon favorably so want an early opinion. The nice thing about kvm_type is at least it has differing meaning per
platform so all the impls can do whatever they need, with the below it’s really only needed on ARM (and obviously macOS specific) so it's a bit odd,
but couldn’t think of how else to be able to be able to get what we need out of the memmap during vm creation. 

How this would be used is almost exactly like how ARMs kvm_type is used. We set up hvf_get_physical_address_range to freeze the memory
map and compute the highest gpa, then check if that exceeds our platforms largest IPA size and if so return a sane error message. If everything
checks out we’d just set the IPA size on the VM config object and then create the VM. The current patch should mostly stay the same after that bit
of plumbing I think besides removing the macOS 13 ifdef’s (and simplifying the copy and pasted loop you pointed out). x86’s
hvf_get_physical_address_range can be NULL.

--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -215,6 +215,10 @@ typedef struct {
  *    Return the type of KVM corresponding to the kvm-type string option or
  *    computed based on other criteria such as the host kernel capabilities.
  *    kvm-type may be NULL if it is not needed.
+ * @hvf_get_physical_address_range:
+ *    Returns the physical address range in bits to use for the HVF virtual
+ *    machine based on the current boards memory map. This may be NULL if it
+ *    is not needed.
  * @numa_mem_supported:
  *    true if '--numa node.mem' option is supported and false otherwise
  * @hotplug_allowed:
@@ -253,6 +257,7 @@ struct MachineClass {
     void (*reset)(MachineState *state, ShutdownCause reason);
     void (*wakeup)(MachineState *state);
     int (*kvm_type)(MachineState *machine, const char *arg);
+    unsigned int (*hvf_get_physical_address_range)(MachineState *machine);

> On Aug 13, 2024, at 2:31 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Mon, 12 Aug 2024 at 23:18, Danny Canter <danny_canter@apple.com> wrote:
>> On Aug 12, 2024, at 10:52 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> This is unfortunately probably going to imply a bit of extra
>>> plumbing to be implemented for hvf -- that MachineClass::kvm_type
>>> method is (as the name suggests) KVM specific. (Multi-patch
>>> patchset for that, where we add the plumbing in as its own
>>> separate patch (and/or whatever other split of functionality
>>> into coherent chunks makes sense), rather than one-big-patch, please.)
>> 
>> That’s perfectly fine, I’ll try and see how the plumbing was done
>> for KVM and try and emulate where it makes sense
>> for HVF. Agree though, that’d definitely push this into multi-patch
>> territory. Curious if you think what’s here today should
>> be multiple patches or the current work seems fine in one?
> 
> I think it was fine as one patch. My personal preference
> when I write code tends to go for more-smaller-patches
> over fewer-larger-patches, so I might have for example
> split out "Add hvf_arch_vm_create()" into its own
> patch, but that's very borderline, and I wouldn't ask for
> that change at code review time unless the patch as a whole
> was too big and unwieldy and I was looking for places to
> suggest a split into multiple patches.
> 
> -- PMM
Peter Maydell Aug. 20, 2024, 1:42 p.m. UTC | #8
On Sat, 17 Aug 2024 at 01:37, Danny Canter <danny_canter@apple.com> wrote:
>
> Peter, thought I’d send this little snippet before getting the rest of V2 done in case anyone hates this :). I tried to take a similar approach to kvm_type,
> but I’m not sure if this will be looked upon favorably so want an early opinion. The nice thing about kvm_type is at least it has differing meaning per
> platform so all the impls can do whatever they need, with the below it’s really only needed on ARM (and obviously macOS specific) so it's a bit odd,
> but couldn’t think of how else to be able to be able to get what we need out of the memmap during vm creation.
>
> How this would be used is almost exactly like how ARMs kvm_type is used. We set up hvf_get_physical_address_range to freeze the memory
> map and compute the highest gpa, then check if that exceeds our platforms largest IPA size and if so return a sane error message. If everything
> checks out we’d just set the IPA size on the VM config object and then create the VM. The current patch should mostly stay the same after that bit
> of plumbing I think besides removing the macOS 13 ifdef’s (and simplifying the copy and pasted loop you pointed out). x86’s
> hvf_get_physical_address_range can be NULL.
>
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -215,6 +215,10 @@ typedef struct {
>   *    Return the type of KVM corresponding to the kvm-type string option or
>   *    computed based on other criteria such as the host kernel capabilities.
>   *    kvm-type may be NULL if it is not needed.
> + * @hvf_get_physical_address_range:
> + *    Returns the physical address range in bits to use for the HVF virtual
> + *    machine based on the current boards memory map. This may be NULL if it
> + *    is not needed.
>   * @numa_mem_supported:
>   *    true if '--numa node.mem' option is supported and false otherwise
>   * @hotplug_allowed:
> @@ -253,6 +257,7 @@ struct MachineClass {
>      void (*reset)(MachineState *state, ShutdownCause reason);
>      void (*wakeup)(MachineState *state);
>      int (*kvm_type)(MachineState *machine, const char *arg);
> +    unsigned int (*hvf_get_physical_address_range)(MachineState *machine);

My gut feeling was that this felt very specific compared
to the kvm_type method which lets different architectures
use it for whatever they need. But on the other hand we
have exactly one use for this for hvf right now and at
least for the foreseeable future it's unlikely we're going
to want to do more. And this API isn't a set-in-stone one,
so we can come back and generalize it later if we ever need
to do that. (Or, if we find we need this kind of hook for
a third hypervisor type, maybe we try to make it hypervisor
agnostic at that point.)

So I'm OK with this.

thanks
-- PMM
Danny Canter Sept. 5, 2024, 6:45 p.m. UTC | #9
Forgot to leave a trail here :). V2 is posted https://lists.gnu.org/archive/html/qemu-devel/2024-08/msg04016.html

Thanks again for review Peter!

> On Aug 13, 2024, at 2:31 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Mon, 12 Aug 2024 at 23:18, Danny Canter <danny_canter@apple.com> wrote:
>> On Aug 12, 2024, at 10:52 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> This is unfortunately probably going to imply a bit of extra
>>> plumbing to be implemented for hvf -- that MachineClass::kvm_type
>>> method is (as the name suggests) KVM specific. (Multi-patch
>>> patchset for that, where we add the plumbing in as its own
>>> separate patch (and/or whatever other split of functionality
>>> into coherent chunks makes sense), rather than one-big-patch, please.)
>> 
>> That’s perfectly fine, I’ll try and see how the plumbing was done
>> for KVM and try and emulate where it makes sense
>> for HVF. Agree though, that’d definitely push this into multi-patch
>> territory. Curious if you think what’s here today should
>> be multiple patches or the current work seems fine in one?
> 
> I think it was fine as one patch. My personal preference
> when I write code tends to go for more-smaller-patches
> over fewer-larger-patches, so I might have for example
> split out "Add hvf_arch_vm_create()" into its own
> patch, but that's very borderline, and I wouldn't ask for
> that change at code review time unless the patch as a whole
> was too big and unwieldy and I was looking for places to
> suggest a split into multiple patches.
> 
> -- PMM
diff mbox series

Patch

diff --git a/accel/hvf/hvf-accel-ops.c b/accel/hvf/hvf-accel-ops.c
index ac08cfb9f3..aa9ea4489b 100644
--- a/accel/hvf/hvf-accel-ops.c
+++ b/accel/hvf/hvf-accel-ops.c
@@ -61,10 +61,6 @@ 
 
 HVFState *hvf_state;
 
-#ifdef __aarch64__
-#define HV_VM_DEFAULT NULL
-#endif
-
 /* Memory slots */
 
 hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
@@ -324,7 +320,7 @@  static int hvf_accel_init(MachineState *ms)
     hv_return_t ret;
     HVFState *s;
 
-    ret = hv_vm_create(HV_VM_DEFAULT);
+    ret = hvf_arch_vm_create(ms);
     assert_hvf_ok(ret);
 
     s = g_new0(HVFState, 1);
diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
index 5b28d17ba1..4a4bb4c768 100644
--- a/include/sysemu/hvf_int.h
+++ b/include/sysemu/hvf_int.h
@@ -65,6 +65,7 @@  void assert_hvf_ok_impl(hv_return_t ret, const char *file, unsigned int line,
 #define assert_hvf_ok(EX) assert_hvf_ok_impl((EX), __FILE__, __LINE__, #EX)
 const char *hvf_return_string(hv_return_t ret);
 int hvf_arch_init(void);
+hv_return_t hvf_arch_vm_create(MachineState *ms);
 int hvf_arch_init_vcpu(CPUState *cpu);
 void hvf_arch_vcpu_destroy(CPUState *cpu);
 int hvf_vcpu_exec(CPUState *);
diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
index ef9bc42738..1d7f4750d4 100644
--- a/target/arm/hvf/hvf.c
+++ b/target/arm/hvf/hvf.c
@@ -10,6 +10,7 @@ 
  */
 
 #include "qemu/osdep.h"
+#include "qemu/units.h"
 #include "qemu/error-report.h"
 
 #include "sysemu/runstate.h"
@@ -23,6 +24,7 @@ 
 
 #include "exec/address-spaces.h"
 #include "hw/irq.h"
+#include "hw/boards.h"
 #include "qemu/main-loop.h"
 #include "sysemu/cpus.h"
 #include "arm-powerctl.h"
@@ -30,6 +32,7 @@ 
 #include "target/arm/internals.h"
 #include "target/arm/multiprocessing.h"
 #include "target/arm/gtimer.h"
+#include "target/arm/internals.h"
 #include "trace/trace-target_arm_hvf.h"
 #include "migration/vmstate.h"
 
@@ -295,6 +298,8 @@  void hvf_arm_init_debug(void)
 #define TMR_CTL_IMASK   (1 << 1)
 #define TMR_CTL_ISTATUS (1 << 2)
 
+#define FIRST_HIGHMEM_PARANGE 40
+
 static void hvf_wfi(CPUState *cpu);
 
 typedef struct HVFVTimer {
@@ -319,6 +324,8 @@  struct hvf_reg_match {
     uint64_t offset;
 };
 
+static uint32_t chosen_ipa_bit_size;
+
 static const struct hvf_reg_match hvf_reg_match[] = {
     { HV_REG_X0,   offsetof(CPUARMState, xregs[0]) },
     { HV_REG_X1,   offsetof(CPUARMState, xregs[1]) },
@@ -839,6 +846,45 @@  static uint64_t hvf_get_reg(CPUState *cpu, int rt)
     return val;
 }
 
+static uint32_t hvf_get_default_ipa_bit_size(void)
+{
+    uint32_t default_ipa_size = 36;
+#if defined(MAC_OS_VERSION_13_0) && \
+    MAC_OS_X_VERSION_MIN_REQUIRED >= MAC_OS_VERSION_13_0
+    hv_return_t ret = hv_vm_config_get_default_ipa_size(&default_ipa_size);
+    assert_hvf_ok(ret);
+#endif
+    return default_ipa_size;
+}
+
+static uint32_t hvf_get_max_ipa_bit_size(void)
+{
+    uint32_t max_ipa_size = hvf_get_default_ipa_bit_size();
+#if defined(MAC_OS_VERSION_13_0) && \
+    MAC_OS_X_VERSION_MIN_REQUIRED >= MAC_OS_VERSION_13_0
+    hv_return_t ret = hv_vm_config_get_max_ipa_size(&max_ipa_size);
+    assert_hvf_ok(ret);
+
+    /*
+     * We clamp any IPA size we want to back the VM with to a valid PARange
+     * value so the guest doesn't try and map memory outside of the valid range.
+     * This logic just clamps the passed in IPA bit size to the first valid
+     * PARange value <= to it.
+     */
+    max_ipa_size = round_down_to_parange_bit_size(max_ipa_size);
+#endif
+    return max_ipa_size;
+}
+
+static void clamp_id_aa64mmfr0_parange_to_ipa_size(uint64_t *id_aa64mmfr0)
+{
+    uint32_t ipa_size = chosen_ipa_bit_size ?
+            chosen_ipa_bit_size : hvf_get_max_ipa_bit_size();
+    /* Clamp down id_aa64mmfr0's PARange to the IPA size the kernel supports. */
+    uint8_t index = round_down_to_parange_index(ipa_size);
+    *id_aa64mmfr0 = (*id_aa64mmfr0 & ~R_ID_AA64MMFR0_PARANGE_MASK) | index;
+}
+
 static bool hvf_arm_get_host_cpu_features(ARMHostCPUFeatures *ahcf)
 {
     ARMISARegisters host_isar = {};
@@ -882,6 +928,8 @@  static bool hvf_arm_get_host_cpu_features(ARMHostCPUFeatures *ahcf)
     r |= hv_vcpu_get_sys_reg(fd, HV_SYS_REG_MIDR_EL1, &ahcf->midr);
     r |= hv_vcpu_destroy(fd);
 
+    clamp_id_aa64mmfr0_parange_to_ipa_size(&host_isar.id_aa64mmfr0);
+
     ahcf->isar = host_isar;
 
     /*
@@ -929,6 +977,66 @@  void hvf_arch_vcpu_destroy(CPUState *cpu)
 {
 }
 
+hv_return_t hvf_arch_vm_create(MachineState *ms)
+{
+    uint32_t default_ipa_size = hvf_get_default_ipa_bit_size();
+    uint32_t max_ipa_size = hvf_get_max_ipa_bit_size();
+    hv_return_t ret;
+
+    chosen_ipa_bit_size = default_ipa_size;
+
+    /*
+     * Set the IPA size for the VM:
+     *
+     * Starting from macOS 13 a new set of APIs were introduced that allow you
+     * to query for the maximum IPA size that is supported on your system. macOS
+     * 13 and 14's kernel both return a value less than 40 bits (typically 39,
+     * but depends on hardware), however starting in macOS 15 and up the IPA
+     * size supported (in the kernel at least) is up to 40 bits. A common scheme
+     * for attempting to get the IPA size prior to the introduction of these new
+     * APIs was to read ID_AA64MMFR0.PARange from a vcpu in the hopes that HVF
+     * was returning the maximum IPA size in that. However, this was not the
+     * case. HVF would return the host's PARange value directly which is
+     * generally larger than 40 bits.
+     *
+     * Using that value we could set up our memory map with regions much outside
+     * the actually supported IPA size, and also advertise a much larger
+     * physical address space to the guest. On the hardware+OS combos where
+     * the IPA size is less than 40 bits, but greater than 36, we also don't
+     * have a valid PARange value to round down to before 36 bits which is
+     * already the default.
+     *
+     * With that in mind, before we make the VM lets grab the maximum supported
+     * IPA size and clamp it down to the first valid PARange value so we can
+     * advertise the correct address size for the guest later on. Then if it's
+     * >= 40 set this as the IPA size for the VM using the new APIs. There's a
+     * small heuristic for actually altering the IPA size for the VM which is
+     * if our requested RAM is encroaching on the top of our default IPA size.
+     * This is just an optimization, as at 40 bits we need to create one more
+     * level of stage2 page tables.
+     */
+#if defined(MAC_OS_VERSION_13_0) && \
+    MAC_OS_X_VERSION_MIN_REQUIRED >= MAC_OS_VERSION_13_0
+    hv_vm_config_t config = hv_vm_config_create();
+
+    /* In our memory map RAM starts at 1GB. */
+    uint64_t threshold = (1ull << default_ipa_size) - (1 * GiB);
+    if (ms->ram_size >= threshold && max_ipa_size >= FIRST_HIGHMEM_PARANGE) {
+        ret = hv_vm_config_set_ipa_size(config, max_ipa_size);
+        assert_hvf_ok(ret);
+
+        chosen_ipa_bit_size = max_ipa_size;
+    }
+
+    ret = hv_vm_create(config);
+    os_release(config);
+#else
+    ret = hv_vm_create(NULL);
+#endif
+
+    return ret;
+}
+
 int hvf_arch_init_vcpu(CPUState *cpu)
 {
     ARMCPU *arm_cpu = ARM_CPU(cpu);
@@ -995,6 +1103,11 @@  int hvf_arch_init_vcpu(CPUState *cpu)
                               &arm_cpu->isar.id_aa64mmfr0);
     assert_hvf_ok(ret);
 
+    clamp_id_aa64mmfr0_parange_to_ipa_size(&arm_cpu->isar.id_aa64mmfr0);
+    ret = hv_vcpu_set_sys_reg(cpu->accel->fd, HV_SYS_REG_ID_AA64MMFR0_EL1,
+                              arm_cpu->isar.id_aa64mmfr0);
+    assert_hvf_ok(ret);
+
     return 0;
 }
 
diff --git a/target/arm/internals.h b/target/arm/internals.h
index da22d04121..3aa804bfe8 100644
--- a/target/arm/internals.h
+++ b/target/arm/internals.h
@@ -436,6 +436,25 @@  static inline void update_spsel(CPUARMState *env, uint32_t imm)
  */
 unsigned int arm_pamax(ARMCPU *cpu);
 
+/*
+ * round_down_to_parange_index
+ * @bit_size: uint8_t
+ *
+ * Rounds down the bit_size supplied to the first supported ARM physical
+ * address range and returns the index for this. The index is intended to
+ * be used to set ID_AA64MMFR0_EL1's PARANGE bits.
+ */
+uint8_t round_down_to_parange_index(uint8_t bit_size);
+
+/*
+ * round_down_to_parange_bit_size
+ * @bit_size: uint8_t
+ *
+ * Rounds down the bit_size supplied to the first supported ARM physical
+ * address range bit size and returns this.
+ */
+uint8_t round_down_to_parange_bit_size(uint8_t bit_size);
+
 /* Return true if extended addresses are enabled.
  * This is always the case if our translation regime is 64 bit,
  * but depends on TTBCR.EAE for 32 bit.
diff --git a/target/arm/ptw.c b/target/arm/ptw.c
index 4476b32ff5..c4abbe5c3d 100644
--- a/target/arm/ptw.c
+++ b/target/arm/ptw.c
@@ -96,6 +96,26 @@  static const uint8_t pamax_map[] = {
     [6] = 52,
 };
 
+uint8_t round_down_to_parange_index(uint8_t bit_size)
+{
+    for (int i = ARRAY_SIZE(pamax_map) - 1; i >= 0; i--) {
+        if (pamax_map[i] <= bit_size) {
+            return i;
+        }
+    }
+    g_assert_not_reached();
+}
+
+uint8_t round_down_to_parange_bit_size(uint8_t bit_size)
+{
+    for (int i = ARRAY_SIZE(pamax_map) - 1; i >= 0; i--) {
+        if (pamax_map[i] <= bit_size) {
+            return pamax_map[i];
+        }
+    }
+    g_assert_not_reached();
+}
+
 /*
  * The cpu-specific constant value of PAMax; also used by hw/arm/virt.
  * Note that machvirt_init calls this on a CPU that is inited but not realized!
diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
index c9c64e2978..99e1915efa 100644
--- a/target/i386/hvf/hvf.c
+++ b/target/i386/hvf/hvf.c
@@ -223,6 +223,11 @@  int hvf_arch_init(void)
     return 0;
 }
 
+hv_return_t hvf_arch_vm_create(MachineState *ms)
+{
+    return hv_vm_create(HV_VM_DEFAULT);
+}
+
 int hvf_arch_init_vcpu(CPUState *cpu)
 {
     X86CPU *x86cpu = X86_CPU(cpu);