diff mbox series

[RFC,13/21] KVM: X86: Handle private MMIO as shared

Message ID 20240823132137.336874-14-aik@amd.com (mailing list archive)
State New
Headers show
Series Secure VFIO, TDISP, SEV TIO | expand

Commit Message

Alexey Kardashevskiy Aug. 23, 2024, 1:21 p.m. UTC
Currently private MMIO nested page faults are not expected so when such
fault occurs, KVM tries moving the faulted page from private to shared
which is not going to work as private MMIO is not backed by memfd.

Handle private MMIO as shared: skip page state change and memfd
page state tracking.

The MMIO KVM memory slot is still marked as shared as the guest can
access it as private or shared so marking the MMIO slot as private
is not going to help.

Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
---
 arch/x86/kvm/mmu/mmu.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Xu Yilun Aug. 30, 2024, 4:57 p.m. UTC | #1
On Fri, Aug 23, 2024 at 11:21:27PM +1000, Alexey Kardashevskiy wrote:
> Currently private MMIO nested page faults are not expected so when such
> fault occurs, KVM tries moving the faulted page from private to shared
> which is not going to work as private MMIO is not backed by memfd.
> 
> Handle private MMIO as shared: skip page state change and memfd

This means host keeps the mapping for private MMIO, which is different
from private memory. Not sure if it is expected, and I want to get
some directions here.

From HW perspective, private MMIO is not intended to be accessed by
host, but the consequence may varies. According to TDISP spec 11.2,
my understanding is private device (known as TDI) should reject the
TLP and transition to TDISP ERROR state. But no further error
reporting or logging is mandated. So the impact to the host system
is specific to each device. In my test environment, an AER
NonFatalErr is reported and nothing more, much better than host
accessing private memory.

On SW side, my concern is how to deal with mmu_notifier. In theory, if
we get pfn from hva we should follow the userspace mapping change. But
that makes no sense. Especially for TDX TEE-IO, private MMIO mapping
in SEPT cannot be changed or invalidated as long as TDI is running.

Another concern may be specific for TDX TEE-IO. Allowing both userspace
mapping and SEPT mapping may be safe for private MMIO, but on
KVM_SET_USER_MEMORY_REGION2,  KVM cannot actually tell if a userspace
addr is really for private MMIO. I.e. user could provide shared memory
addr to KVM but declare it is for private MMIO. The shared memory then
could be mapped in SEPT and cause problem.

So personally I prefer no host mapping for private MMIO.

Thanks,
Yilun

> page state tracking.
> 
> The MMIO KVM memory slot is still marked as shared as the guest can
> access it as private or shared so marking the MMIO slot as private
> is not going to help.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 928cf84778b0..e74f5c3d0821 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4366,7 +4366,11 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  {
>  	bool async;
>  
> -	if (fault->is_private)
> +	if (fault->slot && fault->is_private && !kvm_slot_can_be_private(fault->slot) &&
> +	    (vcpu->kvm->arch.vm_type == KVM_X86_SNP_VM))
> +		pr_warn("%s: private SEV TIO MMIO fault for fault->gfn=%llx\n",
> +			__func__, fault->gfn);
> +	else if (fault->is_private)
>  		return kvm_faultin_pfn_private(vcpu, fault);
>  
>  	async = false;
> -- 
> 2.45.2
> 
>
Alexey Kardashevskiy Sept. 2, 2024, 2:22 a.m. UTC | #2
On 31/8/24 02:57, Xu Yilun wrote:
> On Fri, Aug 23, 2024 at 11:21:27PM +1000, Alexey Kardashevskiy wrote:
>> Currently private MMIO nested page faults are not expected so when such
>> fault occurs, KVM tries moving the faulted page from private to shared
>> which is not going to work as private MMIO is not backed by memfd.
>>
>> Handle private MMIO as shared: skip page state change and memfd
> 
> This means host keeps the mapping for private MMIO, which is different
> from private memory. Not sure if it is expected, and I want to get
> some directions here.

There is no other translation table on AMD though, the same NPT. The 
security is enforced by the RMP table. A device says "bar#x is private" 
so the host + firmware ensure the each corresponding RMP entry is 
"assigned" + "validated" and has a correct IDE stream ID and ASID, and 
the VM's kernel maps it with the Cbit set.

>  From HW perspective, private MMIO is not intended to be accessed by
> host, but the consequence may varies. According to TDISP spec 11.2,
> my understanding is private device (known as TDI) should reject the
> TLP and transition to TDISP ERROR state. But no further error
> reporting or logging is mandated. So the impact to the host system
> is specific to each device. In my test environment, an AER
> NonFatalErr is reported and nothing more, much better than host
> accessing private memory.

afair I get an non-fatal RMP fault so the device does not even notice.

> On SW side, my concern is how to deal with mmu_notifier. In theory, if
> we get pfn from hva we should follow the userspace mapping change. But
> that makes no sense. Especially for TDX TEE-IO, private MMIO mapping
> in SEPT cannot be changed or invalidated as long as TDI is running.

> Another concern may be specific for TDX TEE-IO. Allowing both userspace
> mapping and SEPT mapping may be safe for private MMIO, but on
> KVM_SET_USER_MEMORY_REGION2,  KVM cannot actually tell if a userspace
> addr is really for private MMIO. I.e. user could provide shared memory
> addr to KVM but declare it is for private MMIO. The shared memory then
> could be mapped in SEPT and cause problem.

I am missing lots of context here. When you are starting a guest with a 
passed through device, until the TDISP machinery transitions the TDI 
into RUN, this TDI's MMIO is shared and mapped everywhere. And after 
transitioning to RUN you move mappings from EPT to SEPT?

> So personally I prefer no host mapping for private MMIO.

Nah, cannot skip this step on AMD. Thanks,


> 
> Thanks,
> Yilun
> 
>> page state tracking.
>>
>> The MMIO KVM memory slot is still marked as shared as the guest can
>> access it as private or shared so marking the MMIO slot as private
>> is not going to help.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
>> ---
>>   arch/x86/kvm/mmu/mmu.c | 6 +++++-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 928cf84778b0..e74f5c3d0821 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -4366,7 +4366,11 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>>   {
>>   	bool async;
>>   
>> -	if (fault->is_private)
>> +	if (fault->slot && fault->is_private && !kvm_slot_can_be_private(fault->slot) &&
>> +	    (vcpu->kvm->arch.vm_type == KVM_X86_SNP_VM))
>> +		pr_warn("%s: private SEV TIO MMIO fault for fault->gfn=%llx\n",
>> +			__func__, fault->gfn);
>> +	else if (fault->is_private)
>>   		return kvm_faultin_pfn_private(vcpu, fault);
>>   
>>   	async = false;
>> -- 
>> 2.45.2
>>
>>
Xu Yilun Sept. 3, 2024, 5:13 a.m. UTC | #3
On Mon, Sep 02, 2024 at 12:22:56PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 31/8/24 02:57, Xu Yilun wrote:
> > On Fri, Aug 23, 2024 at 11:21:27PM +1000, Alexey Kardashevskiy wrote:
> > > Currently private MMIO nested page faults are not expected so when such
> > > fault occurs, KVM tries moving the faulted page from private to shared
> > > which is not going to work as private MMIO is not backed by memfd.
> > > 
> > > Handle private MMIO as shared: skip page state change and memfd
> > 
> > This means host keeps the mapping for private MMIO, which is different
> > from private memory. Not sure if it is expected, and I want to get
> > some directions here.
> 
> There is no other translation table on AMD though, the same NPT. The

Sorry for not being clear, when I say "host mapping" I mean host
userspace mapping (host CR3 mapping). By using guest_memfd, there is no
host CR3 mapping for private memory. I'm wondering if we could keep host
CR3 mapping for private MMIO.

> security is enforced by the RMP table. A device says "bar#x is private" so
> the host + firmware ensure the each corresponding RMP entry is "assigned" +
> "validated" and has a correct IDE stream ID and ASID, and the VM's kernel
> maps it with the Cbit set.
> 
> >  From HW perspective, private MMIO is not intended to be accessed by
> > host, but the consequence may varies. According to TDISP spec 11.2,
> > my understanding is private device (known as TDI) should reject the
> > TLP and transition to TDISP ERROR state. But no further error
> > reporting or logging is mandated. So the impact to the host system
> > is specific to each device. In my test environment, an AER
> > NonFatalErr is reported and nothing more, much better than host
> > accessing private memory.
> 
> afair I get an non-fatal RMP fault so the device does not even notice.
> 
> > On SW side, my concern is how to deal with mmu_notifier. In theory, if
> > we get pfn from hva we should follow the userspace mapping change. But
> > that makes no sense. Especially for TDX TEE-IO, private MMIO mapping
> > in SEPT cannot be changed or invalidated as long as TDI is running.
> 
> > Another concern may be specific for TDX TEE-IO. Allowing both userspace
> > mapping and SEPT mapping may be safe for private MMIO, but on
> > KVM_SET_USER_MEMORY_REGION2,  KVM cannot actually tell if a userspace
> > addr is really for private MMIO. I.e. user could provide shared memory
> > addr to KVM but declare it is for private MMIO. The shared memory then
> > could be mapped in SEPT and cause problem.
> 
> I am missing lots of context here. When you are starting a guest with a
> passed through device, until the TDISP machinery transitions the TDI into
> RUN, this TDI's MMIO is shared and mapped everywhere. And after

Yes, that's the situation nowadays. I think if we need to eliminate
host CR3 mapping for private MMIO, a simple way is we don't allow host
CR3 mapping at the first place, even for shared pass through. It is
doable cause:

 1. IIUC, host CR3 mapping for assigned MMIO is only used for pfn
    finding, i.e. host doesn't really (or shouldn't?) access them.
 2. The hint from guest_memfd shows KVM doesn't have to rely on host
    CR3 mapping to find pfn.

> transitioning to RUN you move mappings from EPT to SEPT?

Mostly correct, TDX move mapping from EPT to SEPT after LOCKED and
right before RUN.

> 
> > So personally I prefer no host mapping for private MMIO.
> 
> Nah, cannot skip this step on AMD. Thanks,

Not sure if we are on the same page. I assume from HW perspective, host
CR3 mapping is not necessary for NPT/RMP build?

Thanks,
Yilun

> 
> 
> > 
> > Thanks,
> > Yilun
> > 
> > > page state tracking.
> > > 
> > > The MMIO KVM memory slot is still marked as shared as the guest can
> > > access it as private or shared so marking the MMIO slot as private
> > > is not going to help.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
> > > ---
> > >   arch/x86/kvm/mmu/mmu.c | 6 +++++-
> > >   1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 928cf84778b0..e74f5c3d0821 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4366,7 +4366,11 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > >   {
> > >   	bool async;
> > > -	if (fault->is_private)
> > > +	if (fault->slot && fault->is_private && !kvm_slot_can_be_private(fault->slot) &&
> > > +	    (vcpu->kvm->arch.vm_type == KVM_X86_SNP_VM))
> > > +		pr_warn("%s: private SEV TIO MMIO fault for fault->gfn=%llx\n",
> > > +			__func__, fault->gfn);
> > > +	else if (fault->is_private)
> > >   		return kvm_faultin_pfn_private(vcpu, fault);
> > >   	async = false;
> > > -- 
> > > 2.45.2
> > > 
> > > 
> 
> -- 
> Alexey
>
Alexey Kardashevskiy Sept. 6, 2024, 3:31 a.m. UTC | #4
On 3/9/24 15:13, Xu Yilun wrote:
> On Mon, Sep 02, 2024 at 12:22:56PM +1000, Alexey Kardashevskiy wrote:
>>
>>
>> On 31/8/24 02:57, Xu Yilun wrote:
>>> On Fri, Aug 23, 2024 at 11:21:27PM +1000, Alexey Kardashevskiy wrote:
>>>> Currently private MMIO nested page faults are not expected so when such
>>>> fault occurs, KVM tries moving the faulted page from private to shared
>>>> which is not going to work as private MMIO is not backed by memfd.
>>>>
>>>> Handle private MMIO as shared: skip page state change and memfd
>>>
>>> This means host keeps the mapping for private MMIO, which is different
>>> from private memory. Not sure if it is expected, and I want to get
>>> some directions here.
>>
>> There is no other translation table on AMD though, the same NPT. The
> 
> Sorry for not being clear, when I say "host mapping" I mean host
> userspace mapping (host CR3 mapping). By using guest_memfd, there is no
> host CR3 mapping for private memory. I'm wondering if we could keep host
> CR3 mapping for private MMIO.
> >> security is enforced by the RMP table. A device says "bar#x is 
private" so
>> the host + firmware ensure the each corresponding RMP entry is "assigned" +
>> "validated" and has a correct IDE stream ID and ASID, and the VM's kernel
>> maps it with the Cbit set.
>>
>>>   From HW perspective, private MMIO is not intended to be accessed by
>>> host, but the consequence may varies. According to TDISP spec 11.2,
>>> my understanding is private device (known as TDI) should reject the
>>> TLP and transition to TDISP ERROR state. But no further error
>>> reporting or logging is mandated. So the impact to the host system
>>> is specific to each device. In my test environment, an AER
>>> NonFatalErr is reported and nothing more, much better than host
>>> accessing private memory.
>>
>> afair I get an non-fatal RMP fault so the device does not even notice.
>>
>>> On SW side, my concern is how to deal with mmu_notifier. In theory, if
>>> we get pfn from hva we should follow the userspace mapping change. But
>>> that makes no sense. Especially for TDX TEE-IO, private MMIO mapping
>>> in SEPT cannot be changed or invalidated as long as TDI is running.
>>
>>> Another concern may be specific for TDX TEE-IO. Allowing both userspace
>>> mapping and SEPT mapping may be safe for private MMIO, but on
>>> KVM_SET_USER_MEMORY_REGION2,  KVM cannot actually tell if a userspace
>>> addr is really for private MMIO. I.e. user could provide shared memory
>>> addr to KVM but declare it is for private MMIO. The shared memory then
>>> could be mapped in SEPT and cause problem.
>>
>> I am missing lots of context here. When you are starting a guest with a
>> passed through device, until the TDISP machinery transitions the TDI into
>> RUN, this TDI's MMIO is shared and mapped everywhere. And after
> 
> Yes, that's the situation nowadays. I think if we need to eliminate
> host CR3 mapping for private MMIO, a simple way is we don't allow host
> CR3 mapping at the first place, even for shared pass through. It is
> doable cause:
> 
>   1. IIUC, host CR3 mapping for assigned MMIO is only used for pfn
>      finding, i.e. host doesn't really (or shouldn't?) access them.

Well, the host userspace might also want to access MMIO via mmap'ed 
region if it is, say, DPDK.

>   2. The hint from guest_memfd shows KVM doesn't have to rely on host
>      CR3 mapping to find pfn.

True.

>> transitioning to RUN you move mappings from EPT to SEPT?
> 
> Mostly correct, TDX move mapping from EPT to SEPT after LOCKED and
> right before RUN.
> 
>>
>>> So personally I prefer no host mapping for private MMIO.
>>
>> Nah, cannot skip this step on AMD. Thanks,
> 
> Not sure if we are on the same page.

With the above explanation, we are.

> I assume from HW perspective, host
> CR3 mapping is not necessary for NPT/RMP build?

Yeah, the hw does not require that afaik. But the existing code 
continues working for AMD, and I am guessing it is still true for your 
case too, right? Unless the host userspace tries accessing the private 
MMIO and some horrible stuff happens? Thanks,


> Thanks,
> Yilun
> 
>>
>>
>>>
>>> Thanks,
>>> Yilun
>>>
>>>> page state tracking.
>>>>
>>>> The MMIO KVM memory slot is still marked as shared as the guest can
>>>> access it as private or shared so marking the MMIO slot as private
>>>> is not going to help.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
>>>> ---
>>>>    arch/x86/kvm/mmu/mmu.c | 6 +++++-
>>>>    1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>>> index 928cf84778b0..e74f5c3d0821 100644
>>>> --- a/arch/x86/kvm/mmu/mmu.c
>>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>>> @@ -4366,7 +4366,11 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>>>>    {
>>>>    	bool async;
>>>> -	if (fault->is_private)
>>>> +	if (fault->slot && fault->is_private && !kvm_slot_can_be_private(fault->slot) &&
>>>> +	    (vcpu->kvm->arch.vm_type == KVM_X86_SNP_VM))
>>>> +		pr_warn("%s: private SEV TIO MMIO fault for fault->gfn=%llx\n",
>>>> +			__func__, fault->gfn);
>>>> +	else if (fault->is_private)
>>>>    		return kvm_faultin_pfn_private(vcpu, fault);
>>>>    	async = false;
>>>> -- 
>>>> 2.45.2
>>>>
>>>>
>>
>> -- 
>> Alexey
>>
Xu Yilun Sept. 9, 2024, 10:07 a.m. UTC | #5
On Fri, Sep 06, 2024 at 01:31:48PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 3/9/24 15:13, Xu Yilun wrote:
> > On Mon, Sep 02, 2024 at 12:22:56PM +1000, Alexey Kardashevskiy wrote:
> > > 
> > > 
> > > On 31/8/24 02:57, Xu Yilun wrote:
> > > > On Fri, Aug 23, 2024 at 11:21:27PM +1000, Alexey Kardashevskiy wrote:
> > > > > Currently private MMIO nested page faults are not expected so when such
> > > > > fault occurs, KVM tries moving the faulted page from private to shared
> > > > > which is not going to work as private MMIO is not backed by memfd.
> > > > > 
> > > > > Handle private MMIO as shared: skip page state change and memfd
> > > > 
> > > > This means host keeps the mapping for private MMIO, which is different
> > > > from private memory. Not sure if it is expected, and I want to get
> > > > some directions here.
> > > 
> > > There is no other translation table on AMD though, the same NPT. The
> > 
> > Sorry for not being clear, when I say "host mapping" I mean host
> > userspace mapping (host CR3 mapping). By using guest_memfd, there is no
> > host CR3 mapping for private memory. I'm wondering if we could keep host
> > CR3 mapping for private MMIO.
> > >> security is enforced by the RMP table. A device says "bar#x is
> private" so
> > > the host + firmware ensure the each corresponding RMP entry is "assigned" +
> > > "validated" and has a correct IDE stream ID and ASID, and the VM's kernel
> > > maps it with the Cbit set.
> > > 
> > > >   From HW perspective, private MMIO is not intended to be accessed by
> > > > host, but the consequence may varies. According to TDISP spec 11.2,
> > > > my understanding is private device (known as TDI) should reject the
> > > > TLP and transition to TDISP ERROR state. But no further error
> > > > reporting or logging is mandated. So the impact to the host system
> > > > is specific to each device. In my test environment, an AER
> > > > NonFatalErr is reported and nothing more, much better than host
> > > > accessing private memory.
> > > 
> > > afair I get an non-fatal RMP fault so the device does not even notice.
> > > 
> > > > On SW side, my concern is how to deal with mmu_notifier. In theory, if
> > > > we get pfn from hva we should follow the userspace mapping change. But
> > > > that makes no sense. Especially for TDX TEE-IO, private MMIO mapping
> > > > in SEPT cannot be changed or invalidated as long as TDI is running.
> > > 
> > > > Another concern may be specific for TDX TEE-IO. Allowing both userspace
> > > > mapping and SEPT mapping may be safe for private MMIO, but on
> > > > KVM_SET_USER_MEMORY_REGION2,  KVM cannot actually tell if a userspace
> > > > addr is really for private MMIO. I.e. user could provide shared memory
> > > > addr to KVM but declare it is for private MMIO. The shared memory then
> > > > could be mapped in SEPT and cause problem.
> > > 
> > > I am missing lots of context here. When you are starting a guest with a
> > > passed through device, until the TDISP machinery transitions the TDI into
> > > RUN, this TDI's MMIO is shared and mapped everywhere. And after
> > 
> > Yes, that's the situation nowadays. I think if we need to eliminate
> > host CR3 mapping for private MMIO, a simple way is we don't allow host
> > CR3 mapping at the first place, even for shared pass through. It is
> > doable cause:
> > 
> >   1. IIUC, host CR3 mapping for assigned MMIO is only used for pfn
> >      finding, i.e. host doesn't really (or shouldn't?) access them.
> 
> Well, the host userspace might also want to access MMIO via mmap'ed region
> if it is, say, DPDK.

Yes for DPDK. But I mean for virtualization cases, host doesn't access
assigned MMIO.

I'm not suggesting we remove the entire mmap functionality in VFIO, but
may have a user-optional no-mmap mode for private capable device.

> 
> >   2. The hint from guest_memfd shows KVM doesn't have to rely on host
> >      CR3 mapping to find pfn.
> 
> True.
> 
> > > transitioning to RUN you move mappings from EPT to SEPT?
> > 
> > Mostly correct, TDX move mapping from EPT to SEPT after LOCKED and
> > right before RUN.
> > 
> > > 
> > > > So personally I prefer no host mapping for private MMIO.
> > > 
> > > Nah, cannot skip this step on AMD. Thanks,
> > 
> > Not sure if we are on the same page.
> 
> With the above explanation, we are.
> 
> > I assume from HW perspective, host
> > CR3 mapping is not necessary for NPT/RMP build?
> 
> Yeah, the hw does not require that afaik. But the existing code continues
> working for AMD, and I am guessing it is still true for your case too,

It works for TDX with some minor changes similar as this patch does. But
still see some concerns on my side, E.g. mmu_notifier. Unlike SEV-SNP,
TDX firmware controls private MMIO accessing by building private S2 page
table. If I still follow the HVA based page fault routine, then I should
also follow the mmu_notifier, i.e. change private S2 mapping when HVA
mapping changes. But private MMIO accessing is part of the private dev
configuration and enforced (by firmware) not to be changed when TDI is
RUNning. My effort for this issue is that, don't use HVA based page
fault routine, switch to do like guest_memfd does.

I see SEV-SNP prebuilds RMP to control private MMIO accessing, S2 page
table modification is allowed at anytime. mmu_notifier only makes
private access dis-functional. I assume that could also be nice to
avoid.

> right? Unless the host userspace tries accessing the private MMIO and some
> horrible stuff happens? Thanks,

The common part for all vendors is, the private device will be
disturbed and enter TDISP ERROR state. I'm not sure if this is OK or can
also be nice to avoid.

Thanks,
Yilun
Alexey Kardashevskiy Sept. 10, 2024, 1:28 a.m. UTC | #6
On 9/9/24 20:07, Xu Yilun wrote:
> On Fri, Sep 06, 2024 at 01:31:48PM +1000, Alexey Kardashevskiy wrote:
>>
>>
>> On 3/9/24 15:13, Xu Yilun wrote:
>>> On Mon, Sep 02, 2024 at 12:22:56PM +1000, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 31/8/24 02:57, Xu Yilun wrote:
>>>>> On Fri, Aug 23, 2024 at 11:21:27PM +1000, Alexey Kardashevskiy wrote:
>>>>>> Currently private MMIO nested page faults are not expected so when such
>>>>>> fault occurs, KVM tries moving the faulted page from private to shared
>>>>>> which is not going to work as private MMIO is not backed by memfd.
>>>>>>
>>>>>> Handle private MMIO as shared: skip page state change and memfd
>>>>>
>>>>> This means host keeps the mapping for private MMIO, which is different
>>>>> from private memory. Not sure if it is expected, and I want to get
>>>>> some directions here.
>>>>
>>>> There is no other translation table on AMD though, the same NPT. The
>>>
>>> Sorry for not being clear, when I say "host mapping" I mean host
>>> userspace mapping (host CR3 mapping). By using guest_memfd, there is no
>>> host CR3 mapping for private memory. I'm wondering if we could keep host
>>> CR3 mapping for private MMIO.
>>>>> security is enforced by the RMP table. A device says "bar#x is
>> private" so
>>>> the host + firmware ensure the each corresponding RMP entry is "assigned" +
>>>> "validated" and has a correct IDE stream ID and ASID, and the VM's kernel
>>>> maps it with the Cbit set.
>>>>
>>>>>    From HW perspective, private MMIO is not intended to be accessed by
>>>>> host, but the consequence may varies. According to TDISP spec 11.2,
>>>>> my understanding is private device (known as TDI) should reject the
>>>>> TLP and transition to TDISP ERROR state. But no further error
>>>>> reporting or logging is mandated. So the impact to the host system
>>>>> is specific to each device. In my test environment, an AER
>>>>> NonFatalErr is reported and nothing more, much better than host
>>>>> accessing private memory.
>>>>
>>>> afair I get an non-fatal RMP fault so the device does not even notice.
>>>>
>>>>> On SW side, my concern is how to deal with mmu_notifier. In theory, if
>>>>> we get pfn from hva we should follow the userspace mapping change. But
>>>>> that makes no sense. Especially for TDX TEE-IO, private MMIO mapping
>>>>> in SEPT cannot be changed or invalidated as long as TDI is running.
>>>>
>>>>> Another concern may be specific for TDX TEE-IO. Allowing both userspace
>>>>> mapping and SEPT mapping may be safe for private MMIO, but on
>>>>> KVM_SET_USER_MEMORY_REGION2,  KVM cannot actually tell if a userspace
>>>>> addr is really for private MMIO. I.e. user could provide shared memory
>>>>> addr to KVM but declare it is for private MMIO. The shared memory then
>>>>> could be mapped in SEPT and cause problem.
>>>>
>>>> I am missing lots of context here. When you are starting a guest with a
>>>> passed through device, until the TDISP machinery transitions the TDI into
>>>> RUN, this TDI's MMIO is shared and mapped everywhere. And after
>>>
>>> Yes, that's the situation nowadays. I think if we need to eliminate
>>> host CR3 mapping for private MMIO, a simple way is we don't allow host
>>> CR3 mapping at the first place, even for shared pass through. It is
>>> doable cause:
>>>
>>>    1. IIUC, host CR3 mapping for assigned MMIO is only used for pfn
>>>       finding, i.e. host doesn't really (or shouldn't?) access them.
>>
>> Well, the host userspace might also want to access MMIO via mmap'ed region
>> if it is, say, DPDK.
> 
> Yes for DPDK. But I mean for virtualization cases, host doesn't access
> assigned MMIO.
> 
> I'm not suggesting we remove the entire mmap functionality in VFIO, but
> may have a user-optional no-mmap mode for private capable device.
 >
>>
>>>    2. The hint from guest_memfd shows KVM doesn't have to rely on host
>>>       CR3 mapping to find pfn.
>>
>> True.
>>
>>>> transitioning to RUN you move mappings from EPT to SEPT?
>>>
>>> Mostly correct, TDX move mapping from EPT to SEPT after LOCKED and
>>> right before RUN.
>>>
>>>>
>>>>> So personally I prefer no host mapping for private MMIO.
>>>>
>>>> Nah, cannot skip this step on AMD. Thanks,
>>>
>>> Not sure if we are on the same page.
>>
>> With the above explanation, we are.
>>
>>> I assume from HW perspective, host
>>> CR3 mapping is not necessary for NPT/RMP build?
>>
>> Yeah, the hw does not require that afaik. But the existing code continues
>> working for AMD, and I am guessing it is still true for your case too,
> 
> It works for TDX with some minor changes similar as this patch does. But
> still see some concerns on my side, E.g. mmu_notifier. Unlike SEV-SNP,
> TDX firmware controls private MMIO accessing by building private S2 page
> table. If I still follow the HVA based page fault routine, then I should
> also follow the mmu_notifier, i.e. change private S2 mapping when HVA
> mapping changes. But private MMIO accessing is part of the private dev
> configuration and enforced (by firmware) not to be changed when TDI is
> RUNning. My effort for this issue is that, don't use HVA based page
> fault routine, switch to do like guest_memfd does.

ah I see, thanks.

> I see SEV-SNP prebuilds RMP to control private MMIO accessing, S2 page
> table modification is allowed at anytime. mmu_notifier only makes
> private access dis-functional. I assume that could also be nice to
> avoid.
> 
>> right? Unless the host userspace tries accessing the private MMIO and some
>> horrible stuff happens? Thanks,
> 
> The common part for all vendors is, the private device will be
> disturbed and enter TDISP ERROR state. I'm not sure if this is OK or can
> also be nice to avoid.

For this instance, on AMD, I expect an RMP fault and no device 
disturbance, no TDISP ERROR. Thanks,


> 
> Thanks,
> Yilun
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 928cf84778b0..e74f5c3d0821 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4366,7 +4366,11 @@  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 {
 	bool async;
 
-	if (fault->is_private)
+	if (fault->slot && fault->is_private && !kvm_slot_can_be_private(fault->slot) &&
+	    (vcpu->kvm->arch.vm_type == KVM_X86_SNP_VM))
+		pr_warn("%s: private SEV TIO MMIO fault for fault->gfn=%llx\n",
+			__func__, fault->gfn);
+	else if (fault->is_private)
 		return kvm_faultin_pfn_private(vcpu, fault);
 
 	async = false;