diff mbox series

[2/2] x86/pv: Handle #PF correctly when reading the IO permission bitmap

Message ID 20240930161837.1248144-3-andrew.cooper3@citrix.com (mailing list archive)
State New
Headers show
Series x86/pv: Fixes to guest_io_okay() | expand

Commit Message

Andrew Cooper Sept. 30, 2024, 4:18 p.m. UTC
The switch statement in guest_io_okay() is a very expensive way of
pre-initialising x with ~0, and performing a partial read into it.

However, the logic isn't correct either.

In a real TSS, the CPU always reads two bytes (like here), and any TSS limit
violation turns silently into no-access.  But, in-limit accesses trigger #PF
as usual.  AMD document this property explicitly, and while Intel don't (so
far as I can tell), they do behave consistently with AMD.

Switch from __copy_from_guest_offset() to __copy_from_guest_pv(), like
everything else in this file.  This removes code generation setting up
copy_from_user_hvm() (in the likely path even), and safety LFENCEs from
evaluate_nospec().

Change the logic to raise #PF if __copy_from_guest_pv() fails, rather than
disallowing the IO port access.  This brings the behaviour better in line with
normal x86.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

RFC: Should we make the boundary check be (port + bytes + 8)?  That would be
more correct, but liable to break unsuspecting VMs.  Maybe we should just
comment our way out of it.

This needs to be combined with Jan's "x86/PV: simplify (and thus correct)
guest accessor functions" to function completely correctly.  From XTF testing:

This series on its own:

  --- Xen Test Framework ---
  Environment: PV 64bit (Long mode 4 levels)
  Test pv-emul-cr2
  Error: %cr2 expected 0x3000, got 0x2fff
  Test result: ERROR

This series plus Jan's fix:

  --- Xen Test Framework ---
  Environment: PV 64bit (Long mode 4 levels)
  Test pv-emul-cr2
  Test result: SUCCESS

Interestingly, the test also does an `INW` without an output parameter
straddling that page boundary, and it does reliably get 0x3000 even without
the accessor fix.
---
 xen/arch/x86/pv/emul-priv-op.c | 27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

Comments

Jan Beulich Oct. 1, 2024, 7:36 a.m. UTC | #1
On 30.09.2024 18:18, Andrew Cooper wrote:
> RFC: Should we make the boundary check be (port + bytes + 8)?  That would be
> more correct, but liable to break unsuspecting VMs.  Maybe we should just
> comment our way out of it.

What would the "+ 8" be intended to express? (I take it you mean ...

> --- a/xen/arch/x86/pv/emul-priv-op.c
> +++ b/xen/arch/x86/pv/emul-priv-op.c
> @@ -169,29 +169,26 @@ static intguest_io_okay(unsigned int port, unsigned int bytes,
>  
>      if ( (port + bytes) <= v->arch.pv.iobmp_limit )

... this check, which looks correct to me as is. In particular with the
"+ 8" there would appear to be no way to access ports at the very top of
the 64k range anymore, as PHYSDEVOP_set_iobitmap handling caps nr_ports
at 64k. IOW I think "commenting our way out of it" is the only possible
approach.)

With or without such a comment added
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan
Andrew Cooper Oct. 1, 2024, 10:41 a.m. UTC | #2
On 01/10/2024 8:36 am, Jan Beulich wrote:
> On 30.09.2024 18:18, Andrew Cooper wrote:
>> RFC: Should we make the boundary check be (port + bytes + 8)?  That would be
>> more correct, but liable to break unsuspecting VMs.  Maybe we should just
>> comment our way out of it.
> What would the "+ 8" be intended to express? (I take it you mean ...
>
>> --- a/xen/arch/x86/pv/emul-priv-op.c
>> +++ b/xen/arch/x86/pv/emul-priv-op.c
>> @@ -169,29 +169,26 @@ static intguest_io_okay(unsigned int port, unsigned int bytes,
>>  
>>      if ( (port + bytes) <= v->arch.pv.iobmp_limit )
> ... this check, which looks correct to me as is. In particular with the
> "+ 8" there would appear to be no way to access ports at the very top of
> the 64k range anymore, as PHYSDEVOP_set_iobitmap handling caps nr_ports
> at 64k. IOW I think "commenting our way out of it" is the only possible
> approach.)

This is actually a horrid corner case, different to what I was thinking
yesterday.

In a real TSS, the CPU doesn't read past the TR limit.  Such accesses
are turned into a permission failure.

But in Xen, guest_io_okay() does read one byte past iobmp_limit.

Previously, we'd consume what was there if it was accessible, or declare
a permissions failure if it was inaccessible.  Now, we'll throw #PF back
to the kernel with %cr2 beyond the the limit.


It's not OK for Xen to be reading past the given limit; it's not how
real CPUs behave.  Part of the problem is that, even since it's
introduction in 013351bd7 (2006), the public interface was named
nr_ports while Xen's internal field was called iobmp_limit.

In terms of how it's used these days in PV guests:

Linux points the bitmap into it's real TSS, and sets nr_ports to either
0 or 65536 depending on whether the userspace task is permitted to use
IO ports or not.

NetBSD doesn't seem to use PHYSDEVOP_set_iobitmap at all, and only uses
PHYSDEVOP_set_iopl.

MiniOS uses neither.


I think Xen's internal variable wants renaming to not be a limit, and
the comment for PHYSDEVOP_set_iobitmap wants extending to note that,
like a real TSS, Xen might read one byte beyond the end of the bitmap.

I think this wants doing in a separate patch.  Thoughts?

> With or without such a comment added
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

~Andrew
Jan Beulich Oct. 1, 2024, 10:56 a.m. UTC | #3
On 01.10.2024 12:41, Andrew Cooper wrote:
> I think Xen's internal variable wants renaming to not be a limit, and
> the comment for PHYSDEVOP_set_iobitmap wants extending to note that,
> like a real TSS, Xen might read one byte beyond the end of the bitmap.
> 
> I think this wants doing in a separate patch.  Thoughts?

That's probably going to be best.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 978bd6c0775f..b5d184038fa3 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -169,29 +169,26 @@  static int guest_io_okay(unsigned int port, unsigned int bytes,
 
     if ( (port + bytes) <= v->arch.pv.iobmp_limit )
     {
-        union { uint8_t bytes[2]; uint16_t mask; } x;
+        const void *__user addr = v->arch.pv.iobmp.p + (port >> 3);
+        uint16_t mask;
+        int rc;
 
-        /*
-         * Grab permission bytes from guest space. Inaccessible bytes are
-         * read as 0xff (no access allowed).
-         */
+        /* Grab permission bytes from guest space. */
         if ( user_mode )
             toggle_guest_pt(v);
 
-        switch ( __copy_from_guest_offset(x.bytes, v->arch.pv.iobmp,
-                                          port>>3, 2) )
-        {
-        default: x.bytes[0] = ~0;
-            /* fallthrough */
-        case 1:  x.bytes[1] = ~0;
-            /* fallthrough */
-        case 0:  break;
-        }
+        rc = __copy_from_guest_pv(&mask, addr, 2);
 
         if ( user_mode )
             toggle_guest_pt(v);
 
-        if ( (x.mask & (((1 << bytes) - 1) << (port & 7))) == 0 )
+        if ( rc )
+        {
+            x86_emul_pagefault(0, (unsigned long)addr + bytes - rc, ctxt);
+            return X86EMUL_EXCEPTION;
+        }
+
+        if ( (mask & (((1 << bytes) - 1) << (port & 7))) == 0 )
             return X86EMUL_OKAY;
     }