diff mbox series

[Part2,RFC,v2,10/37] x86/fault: Add support to handle the RMP fault for kernel address

Message ID 20210430123822.13825-11-brijesh.singh@amd.com (mailing list archive)
State New, archived
Headers show
Series Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand

Commit Message

Brijesh Singh April 30, 2021, 12:37 p.m. UTC
When SEV-SNP is enabled globally, a write from the host goes through the
RMP check. When the host writes to pages, hardware checks the following
conditions at the end of page walk:

1. Assigned bit in the RMP table is zero (i.e page is shared).
2. If the page table entry that gives the sPA indicates that the target
   page size is a large page, then all RMP entries for the 4KB
   constituting pages of the target must have the assigned bit 0.
3. Immutable bit in the RMP table is not zero.

The hardware will raise page fault if one of the above conditions is not
met. A host should not encounter the RMP fault in normal execution, but
a malicious guest could trick the hypervisor into it. e.g., a guest does
not make the GHCB page shared, on #VMGEXIT, the hypervisor will attempt
to write to GHCB page.

Try resolving the fault instead of crashing the host. To resolve it,
forcefully clear the assigned bit from the RMP entry to make the page
shared so that the write succeeds.  If the fault handler cannot resolve
the RMP violation, then dump the RMP entry for debugging purposes.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/mm/fault.c | 146 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 146 insertions(+)

Comments

Dave Hansen May 3, 2021, 2:44 p.m. UTC | #1
On 4/30/21 5:37 AM, Brijesh Singh wrote:
> When SEV-SNP is enabled globally, a write from the host goes through the
> RMP check. When the host writes to pages, hardware checks the following
> conditions at the end of page walk:
> 
> 1. Assigned bit in the RMP table is zero (i.e page is shared).
> 2. If the page table entry that gives the sPA indicates that the target
>    page size is a large page, then all RMP entries for the 4KB
>    constituting pages of the target must have the assigned bit 0.
> 3. Immutable bit in the RMP table is not zero.
> 
> The hardware will raise page fault if one of the above conditions is not
> met. A host should not encounter the RMP fault in normal execution, but
> a malicious guest could trick the hypervisor into it. e.g., a guest does
> not make the GHCB page shared, on #VMGEXIT, the hypervisor will attempt
> to write to GHCB page.

Is that the only case which is left?  If so, why don't you simply split
the direct map for GHCB pages before giving them to the guest?  Or, map
them with vmap() so that the mapping is always 4k?

Or, worst case, you could use exception tables and something like
copy_to_user() to write to the GHCB.  That way, the thread doing the
write can safely recover from the fault without the instruction actually
ever finishing execution.

BTW, I went looking through the spec.  I didn't see anything about the
guest being able to write the "Assigned" RMP bit.  Did I miss that?
Which of the above three conditions is triggered by the guest failing to
make the GHCB page shared?
Andy Lutomirski May 3, 2021, 3:03 p.m. UTC | #2
On Mon, May 3, 2021 at 7:44 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/30/21 5:37 AM, Brijesh Singh wrote:
> > When SEV-SNP is enabled globally, a write from the host goes through the
> > RMP check. When the host writes to pages, hardware checks the following
> > conditions at the end of page walk:
> >
> > 1. Assigned bit in the RMP table is zero (i.e page is shared).
> > 2. If the page table entry that gives the sPA indicates that the target
> >    page size is a large page, then all RMP entries for the 4KB
> >    constituting pages of the target must have the assigned bit 0.
> > 3. Immutable bit in the RMP table is not zero.
> >
> > The hardware will raise page fault if one of the above conditions is not
> > met. A host should not encounter the RMP fault in normal execution, but
> > a malicious guest could trick the hypervisor into it. e.g., a guest does
> > not make the GHCB page shared, on #VMGEXIT, the hypervisor will attempt
> > to write to GHCB page.
>
> Is that the only case which is left?  If so, why don't you simply split
> the direct map for GHCB pages before giving them to the guest?  Or, map
> them with vmap() so that the mapping is always 4k?

If I read Brijesh's message right, this isn't about 4k.  It's about
the guest violating host expectations about the page type.

I need to go and do a full read of all the relevant specs, but I think
there's an analogous situation in TDX: if the host touches guest
private memory, the TDX hardware will get extremely angry (more so
than AMD hardware).  And, if I have understood this patch correctly,
it's fudging around the underlying bug by intentionally screwing up
the RMP contents to avoid a page fault.  Assuming I've understood
everything correctly (a big if!), then I think this is backwards.  The
host kernel should not ever access guest memory without a plan in
place to handle failure.  We need real accessors, along the lines of
copy_from_guest() and copy_to_guest().
Brijesh Singh May 3, 2021, 3:37 p.m. UTC | #3
Hi Dave,

On 5/3/21 9:44 AM, Dave Hansen wrote:
> On 4/30/21 5:37 AM, Brijesh Singh wrote:
>> When SEV-SNP is enabled globally, a write from the host goes through the
>> RMP check. When the host writes to pages, hardware checks the following
>> conditions at the end of page walk:
>>
>> 1. Assigned bit in the RMP table is zero (i.e page is shared).
>> 2. If the page table entry that gives the sPA indicates that the target
>>    page size is a large page, then all RMP entries for the 4KB
>>    constituting pages of the target must have the assigned bit 0.
>> 3. Immutable bit in the RMP table is not zero.
>>
>> The hardware will raise page fault if one of the above conditions is not
>> met. A host should not encounter the RMP fault in normal execution, but
>> a malicious guest could trick the hypervisor into it. e.g., a guest does
>> not make the GHCB page shared, on #VMGEXIT, the hypervisor will attempt
>> to write to GHCB page.
> Is that the only case which is left?  If so, why don't you simply split
> the direct map for GHCB pages before giving them to the guest?  Or, map
> them with vmap() so that the mapping is always 4k?

GHCB was just an example. Another example is a vfio driver accessing the
shared page. If those pages are not marked shared then kernel access
will cause an RMP fault. Ideally we should not be running into this
situation, but if we do, then I am trying to see how best we can avoid
the host crashes.

Another reason for having this is to catch  the hypervisor bug, during
the SNP guest create, the KVM allocates few backing pages and sets the
assigned bit for it (the examples are VMSA, and firmware context page).
If hypervisor accidentally free's these pages without clearing the
assigned bit in the RMP table then it will result in RMP fault and thus
a kernel crash.


>
> Or, worst case, you could use exception tables and something like
> copy_to_user() to write to the GHCB.  That way, the thread doing the
> write can safely recover from the fault without the instruction actually
> ever finishing execution.
>
> BTW, I went looking through the spec.  I didn't see anything about the
> guest being able to write the "Assigned" RMP bit.  Did I miss that?
> Which of the above three conditions is triggered by the guest failing to
> make the GHCB page shared?

The GHCB spec section "Page State Change" provides an interface for the
guest to request the page state change. During bootup, the guest uses
the Page State Change VMGEXIT to request hypervisor to make the page
shared. The hypervisor uses the RMPUPDATE instruction to write to
"assigned" bit in the RMP table.

On VMGEXIT, the very first thing which vmgexit handler does is to map
the GHCB page for the access and then later using the copy_to_user() to
sync the GHCB updates from hypervisor to guest. The copy_to_user() will
cause a RMP fault if the GHCB is not mapped shared. As I explained
above, GHCB page was just an example, vfio or other may also get into
this situation.


-Brijesh
Brijesh Singh May 3, 2021, 3:49 p.m. UTC | #4
On 5/3/21 10:03 AM, Andy Lutomirski wrote:
> On Mon, May 3, 2021 at 7:44 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> On 4/30/21 5:37 AM, Brijesh Singh wrote:
>>> When SEV-SNP is enabled globally, a write from the host goes through the
>>> RMP check. When the host writes to pages, hardware checks the following
>>> conditions at the end of page walk:
>>>
>>> 1. Assigned bit in the RMP table is zero (i.e page is shared).
>>> 2. If the page table entry that gives the sPA indicates that the target
>>>    page size is a large page, then all RMP entries for the 4KB
>>>    constituting pages of the target must have the assigned bit 0.
>>> 3. Immutable bit in the RMP table is not zero.
>>>
>>> The hardware will raise page fault if one of the above conditions is not
>>> met. A host should not encounter the RMP fault in normal execution, but
>>> a malicious guest could trick the hypervisor into it. e.g., a guest does
>>> not make the GHCB page shared, on #VMGEXIT, the hypervisor will attempt
>>> to write to GHCB page.
>> Is that the only case which is left?  If so, why don't you simply split
>> the direct map for GHCB pages before giving them to the guest?  Or, map
>> them with vmap() so that the mapping is always 4k?
> If I read Brijesh's message right, this isn't about 4k.  It's about
> the guest violating host expectations about the page type.
>
> I need to go and do a full read of all the relevant specs, but I think
> there's an analogous situation in TDX: if the host touches guest
> private memory, the TDX hardware will get extremely angry (more so
> than AMD hardware).  And, if I have understood this patch correctly,
> it's fudging around the underlying bug by intentionally screwing up
> the RMP contents to avoid a page fault.  Assuming I've understood
> everything correctly (a big if!), then I think this is backwards.  The
> host kernel should not ever access guest memory without a plan in
> place to handle failure.  We need real accessors, along the lines of
> copy_from_guest() and copy_to_guest().

You understood it correctly. Its an underlying bug either in host or
guest which may cause the host accessing the guest private pages. If it
happen avoiding the host crash is much preferred (especially when its a
guest kernel bug).
Dave Hansen May 3, 2021, 4:15 p.m. UTC | #5
On 5/3/21 8:37 AM, Brijesh Singh wrote:
> GHCB was just an example. Another example is a vfio driver accessing the
> shared page. If those pages are not marked shared then kernel access
> will cause an RMP fault. Ideally we should not be running into this
> situation, but if we do, then I am trying to see how best we can avoid
> the host crashes.

I'm confused.  Are you suggesting that the VFIO driver could be passed
an address such that the host kernel would blindly try to write private
guest memory?

The host kernel *knows* which memory is guest private and what is
shared.  It had to set it up in the first place.  It can also consult
the RMP at any time if it somehow forgot.

So, this scenario seems to be that the host got a guest physical address
(gpa) from the guest, it did a gpa->hpa->hva conversion and then wrote
the page all without bothering to consult the RMP.  Shouldn't the the
gpa->hpa conversion point offer a perfect place to determine if the page
is shared or private?

> Another reason for having this is to catch  the hypervisor bug, during
> the SNP guest create, the KVM allocates few backing pages and sets the
> assigned bit for it (the examples are VMSA, and firmware context page).
> If hypervisor accidentally free's these pages without clearing the
> assigned bit in the RMP table then it will result in RMP fault and thus
> a kernel crash.

I think I'd be just fine with a BUG_ON() in those cases instead of an
attempt to paper over the issue.  Kernel crashes are fine in the case of
kernel bugs.

>> Or, worst case, you could use exception tables and something like
>> copy_to_user() to write to the GHCB.  That way, the thread doing the
>> write can safely recover from the fault without the instruction actually
>> ever finishing execution.
>>
>> BTW, I went looking through the spec.  I didn't see anything about the
>> guest being able to write the "Assigned" RMP bit.  Did I miss that?
>> Which of the above three conditions is triggered by the guest failing to
>> make the GHCB page shared?
> 
> The GHCB spec section "Page State Change" provides an interface for the
> guest to request the page state change. During bootup, the guest uses
> the Page State Change VMGEXIT to request hypervisor to make the page
> shared. The hypervisor uses the RMPUPDATE instruction to write to
> "assigned" bit in the RMP table.

Right...  So the *HOST* is in control.  Why should the host ever be
surprised by a page transitioning from shared to private?

> On VMGEXIT, the very first thing which vmgexit handler does is to map
> the GHCB page for the access and then later using the copy_to_user() to
> sync the GHCB updates from hypervisor to guest. The copy_to_user() will
> cause a RMP fault if the GHCB is not mapped shared. As I explained
> above, GHCB page was just an example, vfio or other may also get into
> this situation.

Causing an RMP fault is fine.  The problem is shoving a whole bunch of
*recovery* code in the kernel when recovery isn't necessary.  Just look
for the -EFAULT from copy_to_user() and move on with life.
Brijesh Singh May 3, 2021, 5:19 p.m. UTC | #6
On 5/3/21 11:15 AM, Dave Hansen wrote:
> On 5/3/21 8:37 AM, Brijesh Singh wrote:
>> GHCB was just an example. Another example is a vfio driver accessing the
>> shared page. If those pages are not marked shared then kernel access
>> will cause an RMP fault. Ideally we should not be running into this
>> situation, but if we do, then I am trying to see how best we can avoid
>> the host crashes.
> I'm confused.  Are you suggesting that the VFIO driver could be passed
> an address such that the host kernel would blindly try to write private
> guest memory?

Not blindly. But a guest could trick a VMM (qemu) to ask the host driver
to access a GPA which is guest private page (Its a hypothetical case, so
its possible that I may missing something). Let's see with an example:

- A guest provides a GPA to VMM to write to (e.g DMA operation).

- VMM translates the GPA->HVA and calls down to host kernel with the HVA.

- The host kernel may pin the HVA to get the PFN for it and then kmap().
Write to the mapped PFN will cause an RMP fault if the guest provided
GPA was not a marked shared in the RMP table. In an ideal world, a guest
should *never* do this but what if it does ?


> The host kernel *knows* which memory is guest private and what is
> shared.  It had to set it up in the first place.  It can also consult
> the RMP at any time if it somehow forgot.
>
> So, this scenario seems to be that the host got a guest physical address
> (gpa) from the guest, it did a gpa->hpa->hva conversion and then wrote
> the page all without bothering to consult the RMP.  Shouldn't the the
> gpa->hpa conversion point offer a perfect place to determine if the page
> is shared or private?

The GPA->HVA is typically done by the VMM, and HVA->HPA is done by the
host drivers. So, only time we could verify is after the HVA->HPA. One
of my patch provides a snp_lookup_page_in_rmptable() helper that can be
used to query the page state in the RMP table. This means the all the
host backend drivers need to enlightened to always read the RMP table
before making a write access to guest provided GPA. A good guest should
*never* be using a private page for the DMA operation and if it does
then the fault handler introduced in this patch can avoid the host crash
and eliminate the need to enlightened the drivers to check for the
permission before the access.

I felt it is good idea to have some kind of recovery specially when a
malicious guest could lead us into this path.


>
>> Another reason for having this is to catch  the hypervisor bug, during
>> the SNP guest create, the KVM allocates few backing pages and sets the
>> assigned bit for it (the examples are VMSA, and firmware context page).
>> If hypervisor accidentally free's these pages without clearing the
>> assigned bit in the RMP table then it will result in RMP fault and thus
>> a kernel crash.
> I think I'd be just fine with a BUG_ON() in those cases instead of an
> attempt to paper over the issue.  Kernel crashes are fine in the case of
> kernel bugs.

Yes, fine with me.


>
>>> Or, worst case, you could use exception tables and something like
>>> copy_to_user() to write to the GHCB.  That way, the thread doing the
>>> write can safely recover from the fault without the instruction actually
>>> ever finishing execution.
>>>
>>> BTW, I went looking through the spec.  I didn't see anything about the
>>> guest being able to write the "Assigned" RMP bit.  Did I miss that?
>>> Which of the above three conditions is triggered by the guest failing to
>>> make the GHCB page shared?
>> The GHCB spec section "Page State Change" provides an interface for the
>> guest to request the page state change. During bootup, the guest uses
>> the Page State Change VMGEXIT to request hypervisor to make the page
>> shared. The hypervisor uses the RMPUPDATE instruction to write to
>> "assigned" bit in the RMP table.
> Right...  So the *HOST* is in control.  Why should the host ever be
> surprised by a page transitioning from shared to private?

I am trying is a cover a malicious guest cases. A good guest should
follow the GHCB spec and change the page state before the access.

>
>> On VMGEXIT, the very first thing which vmgexit handler does is to map
>> the GHCB page for the access and then later using the copy_to_user() to
>> sync the GHCB updates from hypervisor to guest. The copy_to_user() will
>> cause a RMP fault if the GHCB is not mapped shared. As I explained
>> above, GHCB page was just an example, vfio or other may also get into
>> this situation.
> Causing an RMP fault is fine.  The problem is shoving a whole bunch of
> *recovery* code in the kernel when recovery isn't necessary.  Just look
> for the -EFAULT from copy_to_user() and move on with life.
Brijesh Singh May 3, 2021, 5:31 p.m. UTC | #7
On 5/3/21 12:19 PM, Brijesh Singh wrote:
> On 5/3/21 11:15 AM, Dave Hansen wrote:
>> On 5/3/21 8:37 AM, Brijesh Singh wrote:
>>> GHCB was just an example. Another example is a vfio driver accessing the
>>> shared page. If those pages are not marked shared then kernel access
>>> will cause an RMP fault. Ideally we should not be running into this
>>> situation, but if we do, then I am trying to see how best we can avoid
>>> the host crashes.
>> I'm confused.  Are you suggesting that the VFIO driver could be passed

One small correction, I was meaning to say VIRTIO but typed VFIO. Sorry
for the confusion.


>> an address such that the host kernel would blindly try to write private
>> guest memory?
> Not blindly. But a guest could trick a VMM (qemu) to ask the host driver
> to access a GPA which is guest private page (Its a hypothetical case, so
> its possible that I may missing something). Let's see with an example:
>
> - A guest provides a GPA to VMM to write to (e.g DMA operation).
>
> - VMM translates the GPA->HVA and calls down to host kernel with the HVA.
>
> - The host kernel may pin the HVA to get the PFN for it and then kmap().
> Write to the mapped PFN will cause an RMP fault if the guest provided
> GPA was not a marked shared in the RMP table. In an ideal world, a guest
> should *never* do this but what if it does ?
>
>
>> The host kernel *knows* which memory is guest private and what is
>> shared.  It had to set it up in the first place.  It can also consult
>> the RMP at any time if it somehow forgot.
>>
>> So, this scenario seems to be that the host got a guest physical address
>> (gpa) from the guest, it did a gpa->hpa->hva conversion and then wrote
>> the page all without bothering to consult the RMP.  Shouldn't the the
>> gpa->hpa conversion point offer a perfect place to determine if the page
>> is shared or private?
> The GPA->HVA is typically done by the VMM, and HVA->HPA is done by the
> host drivers. So, only time we could verify is after the HVA->HPA. One
> of my patch provides a snp_lookup_page_in_rmptable() helper that can be
> used to query the page state in the RMP table. This means the all the
> host backend drivers need to enlightened to always read the RMP table
> before making a write access to guest provided GPA. A good guest should
> *never* be using a private page for the DMA operation and if it does
> then the fault handler introduced in this patch can avoid the host crash
> and eliminate the need to enlightened the drivers to check for the
> permission before the access.
>
> I felt it is good idea to have some kind of recovery specially when a
> malicious guest could lead us into this path.
>
>
>>> Another reason for having this is to catch  the hypervisor bug, during
>>> the SNP guest create, the KVM allocates few backing pages and sets the
>>> assigned bit for it (the examples are VMSA, and firmware context page).
>>> If hypervisor accidentally free's these pages without clearing the
>>> assigned bit in the RMP table then it will result in RMP fault and thus
>>> a kernel crash.
>> I think I'd be just fine with a BUG_ON() in those cases instead of an
>> attempt to paper over the issue.  Kernel crashes are fine in the case of
>> kernel bugs.
> Yes, fine with me.
>
>
>>>> Or, worst case, you could use exception tables and something like
>>>> copy_to_user() to write to the GHCB.  That way, the thread doing the
>>>> write can safely recover from the fault without the instruction actually
>>>> ever finishing execution.
>>>>
>>>> BTW, I went looking through the spec.  I didn't see anything about the
>>>> guest being able to write the "Assigned" RMP bit.  Did I miss that?
>>>> Which of the above three conditions is triggered by the guest failing to
>>>> make the GHCB page shared?
>>> The GHCB spec section "Page State Change" provides an interface for the
>>> guest to request the page state change. During bootup, the guest uses
>>> the Page State Change VMGEXIT to request hypervisor to make the page
>>> shared. The hypervisor uses the RMPUPDATE instruction to write to
>>> "assigned" bit in the RMP table.
>> Right...  So the *HOST* is in control.  Why should the host ever be
>> surprised by a page transitioning from shared to private?
> I am trying is a cover a malicious guest cases. A good guest should
> follow the GHCB spec and change the page state before the access.
>
>>> On VMGEXIT, the very first thing which vmgexit handler does is to map
>>> the GHCB page for the access and then later using the copy_to_user() to
>>> sync the GHCB updates from hypervisor to guest. The copy_to_user() will
>>> cause a RMP fault if the GHCB is not mapped shared. As I explained
>>> above, GHCB page was just an example, vfio or other may also get into
>>> this situation.
>> Causing an RMP fault is fine.  The problem is shoving a whole bunch of
>> *recovery* code in the kernel when recovery isn't necessary.  Just look
>> for the -EFAULT from copy_to_user() and move on with life.
Andy Lutomirski May 3, 2021, 5:40 p.m. UTC | #8
> On May 3, 2021, at 10:19 AM, Brijesh Singh <brijesh.singh@amd.com> wrote:
> 
> 
>> On 5/3/21 11:15 AM, Dave Hansen wrote:
>>> On 5/3/21 8:37 AM, Brijesh Singh wrote:
>>> GHCB was just an example. Another example is a vfio driver accessing the
>>> shared page. If those pages are not marked shared then kernel access
>>> will cause an RMP fault. Ideally we should not be running into this
>>> situation, but if we do, then I am trying to see how best we can avoid
>>> the host crashes.
>> I'm confused.  Are you suggesting that the VFIO driver could be passed
>> an address such that the host kernel would blindly try to write private
>> guest memory?
> 
> Not blindly. But a guest could trick a VMM (qemu) to ask the host driver
> to access a GPA which is guest private page (Its a hypothetical case, so
> its possible that I may missing something). Let's see with an example:
> 
> - A guest provides a GPA to VMM to write to (e.g DMA operation).
> 
> - VMM translates the GPA->HVA and calls down to host kernel with the HVA.
> 
> - The host kernel may pin the HVA to get the PFN for it and then kmap().
> Write to the mapped PFN will cause an RMP fault if the guest provided
> GPA was not a marked shared in the RMP table. In an ideal world, a guest
> should *never* do this but what if it does ?
> 
> 
>> The host kernel *knows* which memory is guest private and what is
>> shared.  It had to set it up in the first place.  It can also consult
>> the RMP at any time if it somehow forgot.
>> 
>> So, this scenario seems to be that the host got a guest physical address
>> (gpa) from the guest, it did a gpa->hpa->hva conversion and then wrote
>> the page all without bothering to consult the RMP.  Shouldn't the the
>> gpa->hpa conversion point offer a perfect place to determine if the page
>> is shared or private?
> 
> The GPA->HVA is typically done by the VMM, and HVA->HPA is done by the
> host drivers. So, only time we could verify is after the HVA->HPA. One
> of my patch provides a snp_lookup_page_in_rmptable() helper that can be
> used to query the page state in the RMP table. This means the all the
> host backend drivers need to enlightened to always read the RMP table
> before making a write access to guest provided GPA. A good guest should
> *never* be using a private page for the DMA operation and if it does
> then the fault handler introduced in this patch can avoid the host crash
> and eliminate the need to enlightened the drivers to check for the
> permission before the access.

Can we arrange for the page walk plus kmap process to fail?

> 
> I felt it is good idea to have some kind of recovery specially when a
> malicious guest could lead us into this path.
> 
> 
>> 
>>> Another reason for having this is to catch  the hypervisor bug, during
>>> the SNP guest create, the KVM allocates few backing pages and sets the
>>> assigned bit for it (the examples are VMSA, and firmware context page).
>>> If hypervisor accidentally free's these pages without clearing the
>>> assigned bit in the RMP table then it will result in RMP fault and thus
>>> a kernel crash.
>> I think I'd be just fine with a BUG_ON() in those cases instead of an
>> attempt to paper over the issue.  Kernel crashes are fine in the case of
>> kernel bugs.
> 
> Yes, fine with me.
> 
> 
>> 
>>>> Or, worst case, you could use exception tables and something like
>>>> copy_to_user() to write to the GHCB.  That way, the thread doing the
>>>> write can safely recover from the fault without the instruction actually
>>>> ever finishing execution.
>>>> 
>>>> BTW, I went looking through the spec.  I didn't see anything about the
>>>> guest being able to write the "Assigned" RMP bit.  Did I miss that?
>>>> Which of the above three conditions is triggered by the guest failing to
>>>> make the GHCB page shared?
>>> The GHCB spec section "Page State Change" provides an interface for the
>>> guest to request the page state change. During bootup, the guest uses
>>> the Page State Change VMGEXIT to request hypervisor to make the page
>>> shared. The hypervisor uses the RMPUPDATE instruction to write to
>>> "assigned" bit in the RMP table.
>> Right...  So the *HOST* is in control.  Why should the host ever be
>> surprised by a page transitioning from shared to private?
> 
> I am trying is a cover a malicious guest cases. A good guest should
> follow the GHCB spec and change the page state before the access.
> 
>> 
>>> On VMGEXIT, the very first thing which vmgexit handler does is to map
>>> the GHCB page for the access and then later using the copy_to_user() to
>>> sync the GHCB updates from hypervisor to guest. The copy_to_user() will
>>> cause a RMP fault if the GHCB is not mapped shared. As I explained
>>> above, GHCB page was just an example, vfio or other may also get into
>>> this situation.
>> Causing an RMP fault is fine.  The problem is shoving a whole bunch of
>> *recovery* code in the kernel when recovery isn't necessary.  Just look
>> for the -EFAULT from copy_to_user() and move on with life.
Brijesh Singh May 3, 2021, 7:41 p.m. UTC | #9
On 5/3/21 12:40 PM, Andy Lutomirski wrote:
>
>> On May 3, 2021, at 10:19 AM, Brijesh Singh <brijesh.singh@amd.com> wrote:
>>
>> 
>>> On 5/3/21 11:15 AM, Dave Hansen wrote:
>>>> On 5/3/21 8:37 AM, Brijesh Singh wrote:
>>>> GHCB was just an example. Another example is a vfio driver accessing the
>>>> shared page. If those pages are not marked shared then kernel access
>>>> will cause an RMP fault. Ideally we should not be running into this
>>>> situation, but if we do, then I am trying to see how best we can avoid
>>>> the host crashes.
>>> I'm confused.  Are you suggesting that the VFIO driver could be passed
>>> an address such that the host kernel would blindly try to write private
>>> guest memory?
>> Not blindly. But a guest could trick a VMM (qemu) to ask the host driver
>> to access a GPA which is guest private page (Its a hypothetical case, so
>> its possible that I may missing something). Let's see with an example:
>>
>> - A guest provides a GPA to VMM to write to (e.g DMA operation).
>>
>> - VMM translates the GPA->HVA and calls down to host kernel with the HVA.
>>
>> - The host kernel may pin the HVA to get the PFN for it and then kmap().
>> Write to the mapped PFN will cause an RMP fault if the guest provided
>> GPA was not a marked shared in the RMP table. In an ideal world, a guest
>> should *never* do this but what if it does ?
>>
>>
>>> The host kernel *knows* which memory is guest private and what is
>>> shared.  It had to set it up in the first place.  It can also consult
>>> the RMP at any time if it somehow forgot.
>>>
>>> So, this scenario seems to be that the host got a guest physical address
>>> (gpa) from the guest, it did a gpa->hpa->hva conversion and then wrote
>>> the page all without bothering to consult the RMP.  Shouldn't the the
>>> gpa->hpa conversion point offer a perfect place to determine if the page
>>> is shared or private?
>> The GPA->HVA is typically done by the VMM, and HVA->HPA is done by the
>> host drivers. So, only time we could verify is after the HVA->HPA. One
>> of my patch provides a snp_lookup_page_in_rmptable() helper that can be
>> used to query the page state in the RMP table. This means the all the
>> host backend drivers need to enlightened to always read the RMP table
>> before making a write access to guest provided GPA. A good guest should
>> *never* be using a private page for the DMA operation and if it does
>> then the fault handler introduced in this patch can avoid the host crash
>> and eliminate the need to enlightened the drivers to check for the
>> permission before the access.
> Can we arrange for the page walk plus kmap process to fail?

Sure, I will look into all the drivers which do a walk plus kmap to make
sure that they fail instead of going into the fault path. Should I drop
this patch or keep it just in the case we miss something?
Dave Hansen May 3, 2021, 7:43 p.m. UTC | #10
On 5/3/21 12:41 PM, Brijesh Singh wrote:
> Sure, I will look into all the drivers which do a walk plus kmap to make
> sure that they fail instead of going into the fault path. Should I drop
> this patch or keep it just in the case we miss something?

I think you should drop it, and just ensure that the existing page fault
oops code can produce a coherent, descriptive error message about what
went wrong.
Brijesh Singh May 4, 2021, 12:31 p.m. UTC | #11
On 5/3/21 2:43 PM, Dave Hansen wrote:
> On 5/3/21 12:41 PM, Brijesh Singh wrote:
>> Sure, I will look into all the drivers which do a walk plus kmap to make
>> sure that they fail instead of going into the fault path. Should I drop
>> this patch or keep it just in the case we miss something?
> I think you should drop it, and just ensure that the existing page fault
> oops code can produce a coherent, descriptive error message about what
> went wrong.

A malicious guest could still trick the host into accessing a guest
private page unless we make sure that host kernel *never* does kmap() on
GPA. The example I was thinking is:

1. Guest provides a GPA to host.

2. Host queries the RMP table and finds that GPA is shared and allows
the kmap() to happen.

3. Guest later changes the page to private.

4. Host write to mapped address will trigger a page-fault.

KVM provides kvm_map_gfn(), kvm_vcpu_map() to map a GPA; these APIs will
no longer be safe to be used. In addition, some shared pages are
registered once by the guest and KVM updates the contents of the page on
vcpu enter (e.g, CPU steal time).

IMHO, we should add the RMP table check before kmap'ing GPA but still
keep this patch to mitigate the cases where a malicious guest changes
the page state after the kmap().

-Brijesh
Dave Hansen May 4, 2021, 2:33 p.m. UTC | #12
On 5/4/21 5:31 AM, Brijesh Singh wrote:
> On 5/3/21 2:43 PM, Dave Hansen wrote:
>> On 5/3/21 12:41 PM, Brijesh Singh wrote:
>>> Sure, I will look into all the drivers which do a walk plus kmap to make
>>> sure that they fail instead of going into the fault path. Should I drop
>>> this patch or keep it just in the case we miss something?
>> I think you should drop it, and just ensure that the existing page fault
>> oops code can produce a coherent, descriptive error message about what
>> went wrong.
> 
> A malicious guest could still trick the host into accessing a guest
> private page unless we make sure that host kernel *never* does kmap() on
> GPA. The example I was thinking is:
> 
> 1. Guest provides a GPA to host.
> 
> 2. Host queries the RMP table and finds that GPA is shared and allows
> the kmap() to happen.
> 
> 3. Guest later changes the page to private.

This literally isn't possible in the SEV-SNP architecture.  I really
wish you would stop stating it.  It's horribly confusing.

The guest can not directly change the page to private.  Only the host
can change the page to private.  The guest must _ask_ the host to do it.
 That's *CRITICALLY* important because what you need to do later is
prevent specific *HOST* behavior.

When those guest requests come it, the host has to ensure that the
request is refused or stalled until there is no chance that the host
will write to the page.  That means that the host needs some locks and
some metadata.

It's also why Andy has been suggesting that you need something along the
lines of copy_to/from_guest().  Those functions would take and release
locks to ensure that shared->private guest page transitions are
impossible while host access to the memory is in flight.

> 4. Host write to mapped address will trigger a page-fault.
> 
> KVM provides kvm_map_gfn(), kvm_vcpu_map() to map a GPA; these APIs will
> no longer be safe to be used.

Yes, it sounds like there is some missing KVM infrastructure that needs
to accompany this series.

> In addition, some shared pages are registered once by the guest and
> KVM updates the contents of the page on vcpu enter (e.g, CPU steal
> time).
Are you suggesting that the host would honor a guest request to convert
to private the shared page used for communicating CPU steal time?  That
seems like a bug to me.

> IMHO, we should add the RMP table check before kmap'ing GPA but still
> keep this patch to mitigate the cases where a malicious guest changes
> the page state after the kmap().

I much prefer a solution where guest requests are placed under
sufficient scrutiny and not blindly followed by the host.
Brijesh Singh May 4, 2021, 3:16 p.m. UTC | #13
Hi Dave,


On 5/4/21 9:33 AM, Dave Hansen wrote:
> On 5/4/21 5:31 AM, Brijesh Singh wrote:
>> On 5/3/21 2:43 PM, Dave Hansen wrote:
>>> On 5/3/21 12:41 PM, Brijesh Singh wrote:
>>>> Sure, I will look into all the drivers which do a walk plus kmap to make
>>>> sure that they fail instead of going into the fault path. Should I drop
>>>> this patch or keep it just in the case we miss something?
>>> I think you should drop it, and just ensure that the existing page fault
>>> oops code can produce a coherent, descriptive error message about what
>>> went wrong.
>> A malicious guest could still trick the host into accessing a guest
>> private page unless we make sure that host kernel *never* does kmap() on
>> GPA. The example I was thinking is:
>>
>> 1. Guest provides a GPA to host.
>>
>> 2. Host queries the RMP table and finds that GPA is shared and allows
>> the kmap() to happen.
>>
>> 3. Guest later changes the page to private.
> This literally isn't possible in the SEV-SNP architecture.  I really
> wish you would stop stating it.  It's horribly confusing.
>
> The guest can not directly change the page to private.  Only the host
> can change the page to private.  The guest must _ask_ the host to do it.
>  That's *CRITICALLY* important because what you need to do later is
> prevent specific *HOST* behavior.
>
> When those guest requests come it, the host has to ensure that the
> request is refused or stalled until there is no chance that the host
> will write to the page.  That means that the host needs some locks and
> some metadata.

Ah, this message clarifies what you and Andy are asking. I was not able
to follow how the kmap'ed addressess will be protected, but now things
are much clear and I feel better dropping this patch. Basically we want
host to keep track of the kmap'ed pages. Stall or reject the guest
request to change the page state if the page is already mapped by the host.


> It's also why Andy has been suggesting that you need something along the
> lines of copy_to/from_guest().  Those functions would take and release
> locks to ensure that shared->private guest page transitions are
> impossible while host access to the memory is in flight.

Now it all make sense.

>
>> 4. Host write to mapped address will trigger a page-fault.
>>
>> KVM provides kvm_map_gfn(), kvm_vcpu_map() to map a GPA; these APIs will
>> no longer be safe to be used.
> Yes, it sounds like there is some missing KVM infrastructure that needs
> to accompany this series.

Yes, I will enhance these APIs to ensure that map'ed GPAs are tracked.
diff mbox series

Patch

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 39d22f6870e1..d833fe84010f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -19,6 +19,7 @@ 
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_crash_gracefully_on_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/sev.h>			/* snp_lookup_page_in_rmptable() */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1132,6 +1133,145 @@  bool fault_in_kernel_space(unsigned long address)
 	return address >= TASK_SIZE_MAX;
 }
 
+#define RMP_FAULT_RETRY		0
+#define RMP_FAULT_KILL		1
+#define RMP_FAULT_PAGE_SPLIT	2
+
+static inline size_t pages_per_hpage(int level)
+{
+	return page_level_size(level) / PAGE_SIZE;
+}
+
+static void dump_rmpentry(unsigned long pfn)
+{
+	struct rmpentry *e;
+	int level;
+
+	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
+
+	/*
+	 * If the RMP entry at the faulting address was not assigned, then dump may not
+	 * provide any useful debug information. Iterate through the entire 2MB region,
+	 * and dump the RMP entries if one of the bit in the RMP entry is set.
+	 */
+	if (rmpentry_assigned(e)) {
+		pr_alert("RMPEntry paddr 0x%lx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx"
+			" asid=%d vmsa=%d validated=%d]\n", pfn << PAGE_SHIFT,
+			rmpentry_assigned(e), rmpentry_immutable(e), rmpentry_pagesize(e),
+			rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
+			rmpentry_validated(e));
+
+		pr_alert("RMPEntry paddr 0x%lx %016llx %016llx\n", pfn << PAGE_SHIFT,
+			e->high, e->low);
+	} else {
+		unsigned long pfn_end;
+
+		pfn = pfn & ~0x1ff;
+		pfn_end = pfn + PTRS_PER_PMD;
+
+		while (pfn < pfn_end) {
+			e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
+
+			if (unlikely(!e))
+				return;
+
+			if (e->low || e->high)
+				pr_alert("RMPEntry paddr 0x%lx: %016llx %016llx\n",
+					pfn << PAGE_SHIFT, e->high, e->low);
+			pfn++;
+		}
+	}
+}
+
+/*
+ * Called for all faults where 'address' is part of the kernel address space.
+ * The function returns RMP_FAULT_RETRY when its able to resolve the fault and
+ * its safe to retry.
+ */
+static int handle_kern_rmp_fault(unsigned long hw_error_code, unsigned long address)
+{
+	int ret, level, rmp_level, mask;
+	struct rmpupdate val = {};
+	struct rmpentry *e;
+	unsigned long pfn;
+	pgd_t *pgd;
+	pte_t *pte;
+
+	if (unlikely(!cpu_feature_enabled(X86_FEATURE_SEV_SNP)))
+		return RMP_FAULT_KILL;
+
+	pgd = __va(read_cr3_pa());
+	pgd += pgd_index(address);
+
+	pte = lookup_address_in_pgd(pgd, address, &level);
+
+	if (unlikely(!pte))
+		return RMP_FAULT_KILL;
+
+	switch(level) {
+	case PG_LEVEL_4K: pfn = pte_pfn(*pte); break;
+	case PG_LEVEL_2M: pfn = pmd_pfn(*(pmd_t *)pte); break;
+	case PG_LEVEL_1G: pfn = pud_pfn(*(pud_t *)pte); break;
+	case PG_LEVEL_512G: pfn = p4d_pfn(*(p4d_t *)pte); break;
+	default: return RMP_FAULT_KILL;
+	}
+
+	/* Calculate the PFN within large page. */
+	if (level > PG_LEVEL_4K) {
+		mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
+		pfn |= (address >> PAGE_SHIFT) & mask;
+	}
+
+	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
+	if (unlikely(!e))
+		return RMP_FAULT_KILL;
+
+	/*
+	 * If the immutable bit is set, we cannot convert the page to shared
+	 * to resolve the fault.
+	 */
+	if (rmpentry_immutable(e))
+		goto e_dump_rmpentry;
+
+	/*
+	 * If the host page level is greather than RMP page level then only way to
+	 * resolve the fault is to split the address. We don't support splitting
+	 * kernel address in the fault path yet.
+	 */
+	if (level > rmp_level)
+		goto e_dump_rmpentry;
+
+	/*
+	 * If the RMP page level is higher than host page level then use the PSMASH
+	 * to split the RMP large entry into 512 4K entries.
+	 */
+	if (rmp_level > level) {
+		ret = psmash(pfn_to_page(pfn & ~0x1FF));
+		if (ret) {
+			pr_alert("Failed to psmash pfn 0x%lx (rc %d)\n", pfn, ret);
+			goto e_dump_rmpentry;
+		}
+	}
+
+	/* Log that the RMP fault handler is clearing the assigned bit. */
+	if (rmpentry_assigned(e))
+		pr_alert("Force address %lx from assigned -> unassigned in RMP table\n", address);
+
+	/* Clear the assigned bit from the RMP table. */
+	ret = rmpupdate(pfn_to_page(pfn), &val);
+	if (ret) {
+		pr_alert("Failed to unassign address 0x%lx in RMP table\n", address);
+		goto e_dump_rmpentry;
+	}
+
+	return RMP_FAULT_RETRY;
+
+e_dump_rmpentry:
+
+	dump_rmpentry(pfn);
+	return RMP_FAULT_KILL;
+}
+
 /*
  * Called for all faults where 'address' is part of the kernel address
  * space.  Might get called for faults that originate from *code* that
@@ -1179,6 +1319,12 @@  do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 	}
 #endif
 
+	/* Try resolving the RMP fault. */
+	if (hw_error_code & X86_PF_RMP) {
+		if (handle_kern_rmp_fault(hw_error_code, address) == RMP_FAULT_RETRY)
+			return;
+	}
+
 	if (is_f00f_bug(regs, hw_error_code, address))
 		return;