KVM: x86: Add host physical address width capability
diff mbox

Message ID jpg615ul1j8.fsf@linux.bootlegged.copy
State New
Headers show

Commit Message

Bandan Das July 8, 2015, 10:36 p.m. UTC
Let userspace inquire the maximum physical address width
of the host processors; this can be used to identify maximum
memory that can be assigned to the guest.

Reported-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
---
 arch/x86/kvm/x86.c       | 3 +++
 include/uapi/linux/kvm.h | 1 +
 2 files changed, 4 insertions(+)

Comments

Paolo Bonzini July 9, 2015, 6:09 a.m. UTC | #1
On 09/07/2015 00:36, Bandan Das wrote:
> Let userspace inquire the maximum physical address width
> of the host processors; this can be used to identify maximum
> memory that can be assigned to the guest.
> 
> Reported-by: Laszlo Ersek <lersek@redhat.com>
> Signed-off-by: Bandan Das <bsd@redhat.com>
> ---
>  arch/x86/kvm/x86.c       | 3 +++
>  include/uapi/linux/kvm.h | 1 +
>  2 files changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index bbaf44e..97d6746 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2683,6 +2683,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_NR_MEMSLOTS:
>  		r = KVM_USER_MEM_SLOTS;
>  		break;
> +	case KVM_CAP_PHY_ADDR_WIDTH:
> +		r = boot_cpu_data.x86_phys_bits;
> +		break;

Userspace can just use CPUID, can't it?

Paolo

>  	case KVM_CAP_PV_MMU:	/* obsolete */
>  		r = 0;
>  		break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 716ad4a..e7949a1 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_DISABLE_QUIRKS 116
>  #define KVM_CAP_X86_SMM 117
>  #define KVM_CAP_MULTI_ADDRESS_SPACE 118
> +#define KVM_CAP_PHY_ADDR_WIDTH 119
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laszlo Ersek July 9, 2015, 6:43 a.m. UTC | #2
On 07/09/15 08:09, Paolo Bonzini wrote:
> 
> 
> On 09/07/2015 00:36, Bandan Das wrote:
>> Let userspace inquire the maximum physical address width
>> of the host processors; this can be used to identify maximum
>> memory that can be assigned to the guest.
>>
>> Reported-by: Laszlo Ersek <lersek@redhat.com>
>> Signed-off-by: Bandan Das <bsd@redhat.com>
>> ---
>>  arch/x86/kvm/x86.c       | 3 +++
>>  include/uapi/linux/kvm.h | 1 +
>>  2 files changed, 4 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index bbaf44e..97d6746 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -2683,6 +2683,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  	case KVM_CAP_NR_MEMSLOTS:
>>  		r = KVM_USER_MEM_SLOTS;
>>  		break;
>> +	case KVM_CAP_PHY_ADDR_WIDTH:
>> +		r = boot_cpu_data.x86_phys_bits;
>> +		break;
> 
> Userspace can just use CPUID, can't it?

I believe KVM's cooperation is necessary, for the following reason:

The truncation only occurs when the guest-phys <-> host-phys translation
is done in hardware, *and* the phys bits of the host processor are
insufficient to represent the highest guest-phys address that the guest
will ever face.

The first condition (of course) means that the truncation depends on EPT
being enabled. (I didn't test on AMD so I don't know if RVI has the same
issue.) If EPT is disabled, either because the host processor lacks it,
or because the respective kvm_intel module parameter is set so, then the
issue cannot be experienced.

Therefore I believe a KVM patch is necessary.

However, this specific patch doesn't seem sufficient; it should also
consider whether EPT is enabled. (And the ioctl should be perhaps
renamed to reflect that -- what QEMU needs to know is not the raw
physical address width of the host processor, but whether that width
will cause EPT to silently truncate high guest-phys addresses.)

Thanks
Laszlo

> 
> Paolo
> 
>>  	case KVM_CAP_PV_MMU:	/* obsolete */
>>  		r = 0;
>>  		break;
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 716ad4a..e7949a1 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info {
>>  #define KVM_CAP_DISABLE_QUIRKS 116
>>  #define KVM_CAP_X86_SMM 117
>>  #define KVM_CAP_MULTI_ADDRESS_SPACE 118
>> +#define KVM_CAP_PHY_ADDR_WIDTH 119
>>  
>>  #ifdef KVM_CAP_IRQ_ROUTING
>>  
>>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini July 9, 2015, 12:41 p.m. UTC | #3
On 09/07/2015 08:43, Laszlo Ersek wrote:
> On 07/09/15 08:09, Paolo Bonzini wrote:
>>
>>
>> On 09/07/2015 00:36, Bandan Das wrote:
>>> Let userspace inquire the maximum physical address width
>>> of the host processors; this can be used to identify maximum
>>> memory that can be assigned to the guest.
>>>
>>> Reported-by: Laszlo Ersek <lersek@redhat.com>
>>> Signed-off-by: Bandan Das <bsd@redhat.com>
>>> ---
>>>  arch/x86/kvm/x86.c       | 3 +++
>>>  include/uapi/linux/kvm.h | 1 +
>>>  2 files changed, 4 insertions(+)
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index bbaf44e..97d6746 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -2683,6 +2683,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>  	case KVM_CAP_NR_MEMSLOTS:
>>>  		r = KVM_USER_MEM_SLOTS;
>>>  		break;
>>> +	case KVM_CAP_PHY_ADDR_WIDTH:
>>> +		r = boot_cpu_data.x86_phys_bits;
>>> +		break;
>>
>> Userspace can just use CPUID, can't it?
> 
> I believe KVM's cooperation is necessary, for the following reason:
> 
> The truncation only occurs when the guest-phys <-> host-phys translation
> is done in hardware, *and* the phys bits of the host processor are
> insufficient to represent the highest guest-phys address that the guest
> will ever face.
> 
> The first condition (of course) means that the truncation depends on EPT
> being enabled. (I didn't test on AMD so I don't know if RVI has the same
> issue.) If EPT is disabled, either because the host processor lacks it,
> or because the respective kvm_intel module parameter is set so, then the
> issue cannot be experienced.
> 
> Therefore I believe a KVM patch is necessary.
> 
> However, this specific patch doesn't seem sufficient; it should also
> consider whether EPT is enabled. (And the ioctl should be perhaps
> renamed to reflect that -- what QEMU needs to know is not the raw
> physical address width of the host processor, but whether that width
> will cause EPT to silently truncate high guest-phys addresses.)

Right; if you want to consider whether EPT is enabled (which is the
right thing to do, albeit it makes for a much bigger patch) a KVM patch
is necessary.  In that case you also need to patch the API documentation.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bandan Das July 9, 2015, 6:32 p.m. UTC | #4
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 09/07/2015 08:43, Laszlo Ersek wrote:
>> On 07/09/15 08:09, Paolo Bonzini wrote:
>>>
>>>
>>> On 09/07/2015 00:36, Bandan Das wrote:
>>>> Let userspace inquire the maximum physical address width
>>>> of the host processors; this can be used to identify maximum
>>>> memory that can be assigned to the guest.
>>>>
>>>> Reported-by: Laszlo Ersek <lersek@redhat.com>
>>>> Signed-off-by: Bandan Das <bsd@redhat.com>
>>>> ---
>>>>  arch/x86/kvm/x86.c       | 3 +++
>>>>  include/uapi/linux/kvm.h | 1 +
>>>>  2 files changed, 4 insertions(+)
>>>>
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index bbaf44e..97d6746 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -2683,6 +2683,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>  	case KVM_CAP_NR_MEMSLOTS:
>>>>  		r = KVM_USER_MEM_SLOTS;
>>>>  		break;
>>>> +	case KVM_CAP_PHY_ADDR_WIDTH:
>>>> +		r = boot_cpu_data.x86_phys_bits;
>>>> +		break;
>>>
>>> Userspace can just use CPUID, can't it?
>> 
>> I believe KVM's cooperation is necessary, for the following reason:
>> 
>> The truncation only occurs when the guest-phys <-> host-phys translation
>> is done in hardware, *and* the phys bits of the host processor are
>> insufficient to represent the highest guest-phys address that the guest
>> will ever face.
>> 
>> The first condition (of course) means that the truncation depends on EPT
>> being enabled. (I didn't test on AMD so I don't know if RVI has the same
>> issue.) If EPT is disabled, either because the host processor lacks it,
>> or because the respective kvm_intel module parameter is set so, then the
>> issue cannot be experienced.
>> 
>> Therefore I believe a KVM patch is necessary.
>> 
>> However, this specific patch doesn't seem sufficient; it should also
>> consider whether EPT is enabled. (And the ioctl should be perhaps
>> renamed to reflect that -- what QEMU needs to know is not the raw
>> physical address width of the host processor, but whether that width
>> will cause EPT to silently truncate high guest-phys addresses.)
>
> Right; if you want to consider whether EPT is enabled (which is the
> right thing to do, albeit it makes for a much bigger patch) a KVM patch
> is necessary.  In that case you also need to patch the API documentation.

Note that this patch really doesn't do anything except for printing a
message that something might potentially go wrong. Without EPT, you don't
hit the processor limitation with your setup, but the user should nevertheless
still be notified. In fact, I think shadow paging code should also emulate
this behavior if the gpa is out of range.

> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laszlo Ersek July 9, 2015, 6:57 p.m. UTC | #5
On 07/09/15 20:32, Bandan Das wrote:
> Paolo Bonzini <pbonzini@redhat.com> writes:
> 
>> On 09/07/2015 08:43, Laszlo Ersek wrote:
>>> On 07/09/15 08:09, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 09/07/2015 00:36, Bandan Das wrote:
>>>>> Let userspace inquire the maximum physical address width
>>>>> of the host processors; this can be used to identify maximum
>>>>> memory that can be assigned to the guest.
>>>>>
>>>>> Reported-by: Laszlo Ersek <lersek@redhat.com>
>>>>> Signed-off-by: Bandan Das <bsd@redhat.com>
>>>>> ---
>>>>>  arch/x86/kvm/x86.c       | 3 +++
>>>>>  include/uapi/linux/kvm.h | 1 +
>>>>>  2 files changed, 4 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>> index bbaf44e..97d6746 100644
>>>>> --- a/arch/x86/kvm/x86.c
>>>>> +++ b/arch/x86/kvm/x86.c
>>>>> @@ -2683,6 +2683,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>>  	case KVM_CAP_NR_MEMSLOTS:
>>>>>  		r = KVM_USER_MEM_SLOTS;
>>>>>  		break;
>>>>> +	case KVM_CAP_PHY_ADDR_WIDTH:
>>>>> +		r = boot_cpu_data.x86_phys_bits;
>>>>> +		break;
>>>>
>>>> Userspace can just use CPUID, can't it?
>>>
>>> I believe KVM's cooperation is necessary, for the following reason:
>>>
>>> The truncation only occurs when the guest-phys <-> host-phys translation
>>> is done in hardware, *and* the phys bits of the host processor are
>>> insufficient to represent the highest guest-phys address that the guest
>>> will ever face.
>>>
>>> The first condition (of course) means that the truncation depends on EPT
>>> being enabled. (I didn't test on AMD so I don't know if RVI has the same
>>> issue.) If EPT is disabled, either because the host processor lacks it,
>>> or because the respective kvm_intel module parameter is set so, then the
>>> issue cannot be experienced.
>>>
>>> Therefore I believe a KVM patch is necessary.
>>>
>>> However, this specific patch doesn't seem sufficient; it should also
>>> consider whether EPT is enabled. (And the ioctl should be perhaps
>>> renamed to reflect that -- what QEMU needs to know is not the raw
>>> physical address width of the host processor, but whether that width
>>> will cause EPT to silently truncate high guest-phys addresses.)
>>
>> Right; if you want to consider whether EPT is enabled (which is the
>> right thing to do, albeit it makes for a much bigger patch) a KVM patch
>> is necessary.  In that case you also need to patch the API documentation.
> 
> Note that this patch really doesn't do anything except for printing a
> message that something might potentially go wrong.

Yes.

> Without EPT, you don't
> hit the processor limitation with your setup, but the user should nevertheless
> still be notified.

I disagree.

> In fact, I think shadow paging code should also emulate
> this behavior if the gpa is out of range.

I disagree.

There is no "out of range" gpa. QEMU allocates enough memory, and it
should be completely transparent to the guest. The fact that it silently
breaks with nested paging if the host processor doesn't have enough
address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not
sure, but I suspect it's a hardware bug). In any case the guest
shouldn't care at all. It is a *virtual* machine, and the VMM should lie
to it plausibly enough. How much RAM, and how many phys address bits the
host has, is a performance question, but it should not be a correctness
question. A 256 GB guest should run (slowly, but correctly) on a laptop
that has only 4 GB of RAM and only 36 phys addr bits, but plenty of swap
space.

Because otherwise your argument could be extrapolated as "TCG should
break too if the gpa is 'out of range'".

So, I disagree. Whatever memory you give to the guest should just work
(unless of course you want to emulate a small address width for the
*VCPU*, but that's absolutely not the use case here). What we have here
is a leaky abstraction: a PCPU limitation giving away a lie that the
guest should never notice. The guest should be able to use all memory
that was specified with QEMU's -m, regardless of TCG vs. KVM-without-EPT
vs. KVM-with-EPT. If the last case cannot work (due to hardware
limitations), that's fine, but then (and only then) a warning should be
printed.

... In any case, please understand that I'm not campaigning for this
warning :) IIRC the warning was your (very welcome!) idea after I
reported the problem; I'm just trying to ensure that the warning match
the exact issue I encountered.

Thanks!
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bandan Das July 9, 2015, 8:02 p.m. UTC | #6
Laszlo Ersek <lersek@redhat.com> writes:
...
> Yes.
>
>> Without EPT, you don't
>> hit the processor limitation with your setup, but the user should nevertheless
>> still be notified.
>
> I disagree.
>
>> In fact, I think shadow paging code should also emulate
>> this behavior if the gpa is out of range.
>
> I disagree.
>
> There is no "out of range" gpa. QEMU allocates enough memory, and it
> should be completely transparent to the guest. The fact that it silently
> breaks with nested paging if the host processor doesn't have enough
> address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not
> sure, but I suspect it's a hardware bug). In any case the guest
> shouldn't care at all. It is a *virtual* machine, and the VMM should lie
> to it plausibly enough. How much RAM, and how many phys address bits the
> host has, is a performance question, but it should not be a correctness
> question. A 256 GB guest should run (slowly, but correctly) on a laptop
> that has only 4 GB of RAM and only 36 phys addr bits, but plenty of swap
> space.
>
> Because otherwise your argument could be extrapolated as "TCG should
> break too if the gpa is 'out of range'".
>
> So, I disagree. Whatever memory you give to the guest should just work
> (unless of course you want to emulate a small address width for the
> *VCPU*, but that's absolutely not the use case here). What we have here
> is a leaky abstraction: a PCPU limitation giving away a lie that the
> guest should never notice. The guest should be able to use all memory
> that was specified with QEMU's -m, regardless of TCG vs. KVM-without-EPT
> vs. KVM-with-EPT. If the last case cannot work (due to hardware
> limitations), that's fine, but then (and only then) a warning should be
> printed.

Hmm... Ok, I understand your point. So, this is more like a EPT
limitation/bug in that Qemu isn't complaining about the memory assigned
to the guest but EPT code is breaking owing to the processor physical
address width. And honestly, I now think that this patch just makes the whole
situation more confusing :) I am wondering if it's just possible for kvm to
simply throw an error like a EPT misconfiguration or something ..

Or in other words, if using a hardware assisted mechanism is just not
possible, KVM will simply not let it run instead of letting a guest
stuck in boot.


> ... In any case, please understand that I'm not campaigning for this
> warning :) IIRC the warning was your (very welcome!) idea after I
> reported the problem; I'm just trying to ensure that the warning match
> the exact issue I encountered.
>
> Thanks!
> Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laszlo Ersek July 9, 2015, 8:15 p.m. UTC | #7
On 07/09/15 22:02, Bandan Das wrote:
> Laszlo Ersek <lersek@redhat.com> writes:
> ...
>> Yes.
>>
>>> Without EPT, you don't
>>> hit the processor limitation with your setup, but the user should nevertheless
>>> still be notified.
>>
>> I disagree.
>>
>>> In fact, I think shadow paging code should also emulate
>>> this behavior if the gpa is out of range.
>>
>> I disagree.
>>
>> There is no "out of range" gpa. QEMU allocates enough memory, and it
>> should be completely transparent to the guest. The fact that it silently
>> breaks with nested paging if the host processor doesn't have enough
>> address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not
>> sure, but I suspect it's a hardware bug). In any case the guest
>> shouldn't care at all. It is a *virtual* machine, and the VMM should lie
>> to it plausibly enough. How much RAM, and how many phys address bits the
>> host has, is a performance question, but it should not be a correctness
>> question. A 256 GB guest should run (slowly, but correctly) on a laptop
>> that has only 4 GB of RAM and only 36 phys addr bits, but plenty of swap
>> space.
>>
>> Because otherwise your argument could be extrapolated as "TCG should
>> break too if the gpa is 'out of range'".
>>
>> So, I disagree. Whatever memory you give to the guest should just work
>> (unless of course you want to emulate a small address width for the
>> *VCPU*, but that's absolutely not the use case here). What we have here
>> is a leaky abstraction: a PCPU limitation giving away a lie that the
>> guest should never notice. The guest should be able to use all memory
>> that was specified with QEMU's -m, regardless of TCG vs. KVM-without-EPT
>> vs. KVM-with-EPT. If the last case cannot work (due to hardware
>> limitations), that's fine, but then (and only then) a warning should be
>> printed.
> 
> Hmm... Ok, I understand your point. So, this is more like a EPT
> limitation/bug in that Qemu isn't complaining about the memory assigned
> to the guest but EPT code is breaking owing to the processor physical
> address width.

Exactly.

> And honestly, I now think that this patch just makes the whole
> situation more confusing :) I am wondering if it's just possible for kvm to
> simply throw an error like a EPT misconfiguration or something ..
> 
> Or in other words, if using a hardware assisted mechanism is just not
> possible, KVM will simply not let it run instead of letting a guest
> stuck in boot.

That would be the best solution.

Thanks
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bandan Das July 10, 2015, 3:37 a.m. UTC | #8
Bandan Das <bsd@redhat.com> writes:

> Laszlo Ersek <lersek@redhat.com> writes:
> ...
>> Yes.
>>
>>> Without EPT, you don't
>>> hit the processor limitation with your setup, but the user should nevertheless
>>> still be notified.
>>
>> I disagree.
>>
>>> In fact, I think shadow paging code should also emulate
>>> this behavior if the gpa is out of range.
>>
>> I disagree.
>>
>> There is no "out of range" gpa. QEMU allocates enough memory, and it
>> should be completely transparent to the guest. The fact that it silently
>> breaks with nested paging if the host processor doesn't have enough
>> address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not
>> sure, but I suspect it's a hardware bug). In any case the guest
>> shouldn't care at all. It is a *virtual* machine, and the VMM should lie
>> to it plausibly enough. How much RAM, and how many phys address bits the
>> host has, is a performance question, but it should not be a correctness
>> question. A 256 GB guest should run (slowly, but correctly) on a laptop
>> that has only 4 GB of RAM and only 36 phys addr bits, but plenty of swap
>> space.
>>
>> Because otherwise your argument could be extrapolated as "TCG should
>> break too if the gpa is 'out of range'".
>>
>> So, I disagree. Whatever memory you give to the guest should just work
>> (unless of course you want to emulate a small address width for the
>> *VCPU*, but that's absolutely not the use case here). What we have here
>> is a leaky abstraction: a PCPU limitation giving away a lie that the
>> guest should never notice. The guest should be able to use all memory
>> that was specified with QEMU's -m, regardless of TCG vs. KVM-without-EPT
>> vs. KVM-with-EPT. If the last case cannot work (due to hardware
>> limitations), that's fine, but then (and only then) a warning should be
>> printed.
>
> Hmm... Ok, I understand your point. So, this is more like a EPT
> limitation/bug in that Qemu isn't complaining about the memory assigned
> to the guest but EPT code is breaking owing to the processor physical
> address width. And honestly, I now think that this patch just makes the whole
> situation more confusing :) I am wondering if it's just possible for kvm to
> simply throw an error like a EPT misconfiguration or something ..
>
> Or in other words, if using a hardware assisted mechanism is just not
> possible, KVM will simply not let it run instead of letting a guest
> stuck in boot.

I noticed that when the guest gets stuck, trace shows an endless loop of
EXTERNAL_INTERRUPT exits with code 14 (PF).

There's a note in 28.2.2 of the spec that "No processors supporting
the Intel64 architecture support more than 48 physical-address bits..
An attempt to use such an address causes a page fault". So, my first
guess was to print out the guest physical address. That seems to be
well beyond the range and is always 0xff000 (when the guest is stuck).

The other thing I can think of is the EPT entries have bits in the
51:N range set which is reserved and always 0. I haven't verified
but it looks like there's ept_misconfig_inspect_spte() that should
already catch this condition. I am out of ideas for today :)

>
>> ... In any case, please understand that I'm not campaigning for this
>> warning :) IIRC the warning was your (very welcome!) idea after I
>> reported the problem; I'm just trying to ensure that the warning match
>> the exact issue I encountered.
>>
>> Thanks!
>> Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini July 10, 2015, 2:13 p.m. UTC | #9
On 09/07/2015 20:57, Laszlo Ersek wrote:
>> Without EPT, you don't
>> hit the processor limitation with your setup, but the user should nevertheless
>> still be notified.
> 
> I disagree.

FWIW, I also disagree (and it looks like Bandan disagrees with himself
now :)).

>> In fact, I think shadow paging code should also emulate
>> this behavior if the gpa is out of range.
> 
> I disagree.

Same here.

> There is no "out of range" gpa. QEMU allocates enough memory, and it
> should be completely transparent to the guest. The fact that it silently
> breaks with nested paging if the host processor doesn't have enough
> address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not
> sure, but I suspect it's a hardware bug).

It's a hardware bug, possibly due to some limitations in the physical
addresses that the TLB can store?  I guess KVM could detect the
situation and fall back to sloooow shadow paging.

> ... In any case, please understand that I'm not campaigning for this
> warning :) IIRC the warning was your (very welcome!) idea after I
> reported the problem; I'm just trying to ensure that the warning match
> the exact issue I encountered.

Yup.  I think the right thing to do would be to hide memory above the
limit.  A kernel patch to query the limit is definitely necessary, but
it needs to return e.g. 48 for shadow paging (otherwise you could just
use CPUID).  I'm not sure if the rest is possible with just QEMU, or it
requires help from the firmware.  Probably yes.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laszlo Ersek July 10, 2015, 2:57 p.m. UTC | #10
On 07/10/15 16:13, Paolo Bonzini wrote:
> 
> 
> On 09/07/2015 20:57, Laszlo Ersek wrote:
>>> Without EPT, you don't
>>> hit the processor limitation with your setup, but the user should nevertheless
>>> still be notified.
>>
>> I disagree.
> 
> FWIW, I also disagree (and it looks like Bandan disagrees with himself
> now :)).
> 
>>> In fact, I think shadow paging code should also emulate
>>> this behavior if the gpa is out of range.
>>
>> I disagree.
> 
> Same here.
> 
>> There is no "out of range" gpa. QEMU allocates enough memory, and it
>> should be completely transparent to the guest. The fact that it silently
>> breaks with nested paging if the host processor doesn't have enough
>> address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not
>> sure, but I suspect it's a hardware bug).
> 
> It's a hardware bug, possibly due to some limitations in the physical
> addresses that the TLB can store?  I guess KVM could detect the
> situation and fall back to sloooow shadow paging.
> 
>> ... In any case, please understand that I'm not campaigning for this
>> warning :) IIRC the warning was your (very welcome!) idea after I
>> reported the problem; I'm just trying to ensure that the warning match
>> the exact issue I encountered.
> 
> Yup.  I think the right thing to do would be to hide memory above the
> limit.

How so?

- The stack would not be doing what the user asks for. Pass -m <a_lot>,
and the guest would silently see less memory. If the user found out,
he'd immediately ask (or set out debugging) why. I think if the user's
request cannot be satisfied, the stack should fail hard.

- Assuming the user didn't find out, and the guest just worked (with
less memory than the user asked for), then the hidden portion of the
memory (that QEMU allocated nonetheless) would be just wasted, on the
host system. (Especially with overcommit_memory=2 (which is the most
prudent setting).)

Thanks
Laszlo

>  A kernel patch to query the limit is definitely necessary, but
> it needs to return e.g. 48 for shadow paging (otherwise you could just
> use CPUID).  I'm not sure if the rest is possible with just QEMU, or it
> requires help from the firmware.  Probably yes.
> 
> Paolo
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini July 10, 2015, 2:59 p.m. UTC | #11
On 10/07/2015 16:57, Laszlo Ersek wrote:
> > > ... In any case, please understand that I'm not campaigning for this
> > > warning :) IIRC the warning was your (very welcome!) idea after I
> > > reported the problem; I'm just trying to ensure that the warning match
> > > the exact issue I encountered.
> > 
> > Yup.  I think the right thing to do would be to hide memory above the
> > limit.
> How so?
> 
> - The stack would not be doing what the user asks for. Pass -m <a_lot>,
> and the guest would silently see less memory. If the user found out,
> he'd immediately ask (or set out debugging) why. I think if the user's
> request cannot be satisfied, the stack should fail hard.

That's another possibility.  I think both of them are wrong depending on
_why_ you're using "-m <a lot>" in the first place.

Considering that this really happens (on Xeons) only for 1TB+ guests,
it's probably just for debugging and then hiding the memory makes some
sense.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laszlo Ersek July 10, 2015, 3:06 p.m. UTC | #12
On 07/10/15 16:59, Paolo Bonzini wrote:
> 
> 
> On 10/07/2015 16:57, Laszlo Ersek wrote:
>>>> ... In any case, please understand that I'm not campaigning for this
>>>> warning :) IIRC the warning was your (very welcome!) idea after I
>>>> reported the problem; I'm just trying to ensure that the warning match
>>>> the exact issue I encountered.
>>>
>>> Yup.  I think the right thing to do would be to hide memory above the
>>> limit.
>> How so?
>>
>> - The stack would not be doing what the user asks for. Pass -m <a_lot>,
>> and the guest would silently see less memory. If the user found out,
>> he'd immediately ask (or set out debugging) why. I think if the user's
>> request cannot be satisfied, the stack should fail hard.
> 
> That's another possibility.  I think both of them are wrong depending on
> _why_ you're using "-m <a lot>" in the first place.
> 
> Considering that this really happens (on Xeons) only for 1TB+ guests,

I reported this issue because I ran into it with a ~64GB guest. From my
/proc/cpuinfo:

model name      : Intel(R) Core(TM) i7 CPU       M 620  @ 2.67GHz
address sizes   : 36 bits physical, 48 bits virtual

I was specifically developing 64GB+ support for OVMF, and this
limitation caused me to think that there was a bug in my OVMF patches.
(There wasn't.) An error message from QEMU, advising me to turn off EPT,
would have saved me many hours.

Thanks
Laszlo

> it's probably just for debugging and then hiding the memory makes some
> sense.
> 
> Paolo
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bandan Das July 10, 2015, 3:45 p.m. UTC | #13
Laszlo Ersek <lersek@redhat.com> writes:

> On 07/10/15 16:59, Paolo Bonzini wrote:
>> 
>> 
>> On 10/07/2015 16:57, Laszlo Ersek wrote:
>>>>> ... In any case, please understand that I'm not campaigning for this
>>>>> warning :) IIRC the warning was your (very welcome!) idea after I
>>>>> reported the problem; I'm just trying to ensure that the warning match
>>>>> the exact issue I encountered.
>>>>
>>>> Yup.  I think the right thing to do would be to hide memory above the
>>>> limit.
>>> How so?
>>>
>>> - The stack would not be doing what the user asks for. Pass -m <a_lot>,
>>> and the guest would silently see less memory. If the user found out,
>>> he'd immediately ask (or set out debugging) why. I think if the user's
>>> request cannot be satisfied, the stack should fail hard.
>> 
>> That's another possibility.  I think both of them are wrong depending on
>> _why_ you're using "-m <a lot>" in the first place.
>> 
>> Considering that this really happens (on Xeons) only for 1TB+ guests,
>
> I reported this issue because I ran into it with a ~64GB guest. From my
> /proc/cpuinfo:
>
> model name      : Intel(R) Core(TM) i7 CPU       M 620  @ 2.67GHz
> address sizes   : 36 bits physical, 48 bits virtual
>
> I was specifically developing 64GB+ support for OVMF, and this
> limitation caused me to think that there was a bug in my OVMF patches.
> (There wasn't.) An error message from QEMU, advising me to turn off EPT,
> would have saved me many hours.

Right, I specifically reserved a system with 36 bits physical to reproduce
this and it was very easy to reproduce. If it's a hardware bug, I would say,
it's a very annoying one (if not serious). I wonder if Intel folks can
chime in.

> Thanks
> Laszlo
>
>> it's probably just for debugging and then hiding the memory makes some
>> sense.
Actually, I agree with Laszlo here. Hiding memory is synonymous to forcing the
user to use less for the -m argument as is failing. But failing and letting the
user do it himself can save hours of debugging.

Regards,
The confused teenager who can't make up his mind.

>> Paolo
>> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bbaf44e..97d6746 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2683,6 +2683,9 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_NR_MEMSLOTS:
 		r = KVM_USER_MEM_SLOTS;
 		break;
+	case KVM_CAP_PHY_ADDR_WIDTH:
+		r = boot_cpu_data.x86_phys_bits;
+		break;
 	case KVM_CAP_PV_MMU:	/* obsolete */
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 716ad4a..e7949a1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -817,6 +817,7 @@  struct kvm_ppc_smmu_info {
 #define KVM_CAP_DISABLE_QUIRKS 116
 #define KVM_CAP_X86_SMM 117
 #define KVM_CAP_MULTI_ADDRESS_SPACE 118
+#define KVM_CAP_PHY_ADDR_WIDTH 119
 
 #ifdef KVM_CAP_IRQ_ROUTING