diff mbox series

x86/pv: Inject #GP for implicit grant unmaps

Message ID 20220725175013.893-1-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86/pv: Inject #GP for implicit grant unmaps | expand

Commit Message

Andrew Cooper July 25, 2022, 5:50 p.m. UTC
This is a debug behaviour to identify buggy kernels.  Crashing the domain is
the most unhelpful thing to do, because it discards the relevant context.

Instead, inject #GP[0] like other permission errors in x86.  In particular,
this lets the kernel provide a backtrace that's actually helpful to a
developer trying to figure out what's going wrong.

As a bugfix, this always injects #GP[0] to current, not l1e_owner.  It is not
l1e_owner's fault if dom0 using superpowers triggers an implicit unmap.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Juergen Gross <jgross@suse.com>

This is a prerequisite to investigating
https://github.com/QubesOS/qubes-issues/issues/7631 which is looking like an
error in Linux's gntdev driver.
---
 xen/arch/x86/mm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jan Beulich July 26, 2022, 6:29 a.m. UTC | #1
On 25.07.2022 19:50, Andrew Cooper wrote:
> This is a debug behaviour to identify buggy kernels.  Crashing the domain is
> the most unhelpful thing to do, because it discards the relevant context.
> 
> Instead, inject #GP[0] like other permission errors in x86.  In particular,
> this lets the kernel provide a backtrace that's actually helpful to a
> developer trying to figure out what's going wrong.
> 
> As a bugfix, this always injects #GP[0] to current, not l1e_owner.  It is not
> l1e_owner's fault if dom0 using superpowers triggers an implicit unmap.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>

Albeit preferably with ...

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -1232,7 +1232,7 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
>          gdprintk(XENLOG_WARNING,
>                   "Attempt to implicitly unmap a granted PTE %" PRIpte "\n",
>                   l1e_get_intpte(l1e));
> -        domain_crash(l1e_owner);
> +        pv_inject_hw_exception(TRAP_gp_fault, 0);
>      }
>  #endif

... the gdprintk() adjusted to also log l1e_owner.

Jan
Andrew Cooper July 26, 2022, 11:51 a.m. UTC | #2
On 26/07/2022 07:29, Jan Beulich wrote:
> On 25.07.2022 19:50, Andrew Cooper wrote:
>> This is a debug behaviour to identify buggy kernels.  Crashing the domain is
>> the most unhelpful thing to do, because it discards the relevant context.
>>
>> Instead, inject #GP[0] like other permission errors in x86.  In particular,
>> this lets the kernel provide a backtrace that's actually helpful to a
>> developer trying to figure out what's going wrong.
>>
>> As a bugfix, this always injects #GP[0] to current, not l1e_owner.  It is not
>> l1e_owner's fault if dom0 using superpowers triggers an implicit unmap.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Jan Beulich <jbeulich@suse.com>
>
> Albeit preferably with ...
>
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -1232,7 +1232,7 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
>>          gdprintk(XENLOG_WARNING,
>>                   "Attempt to implicitly unmap a granted PTE %" PRIpte "\n",
>>                   l1e_get_intpte(l1e));
>> -        domain_crash(l1e_owner);
>> +        pv_inject_hw_exception(TRAP_gp_fault, 0);
>>      }
>>  #endif
> ... the gdprintk() adjusted to also log l1e_owner.

Ok, how about this incremental?

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index b3393385ffb6..74054fb5f4ee 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1229,9 +1229,9 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct
domain *l1e_owner)
     if ( (l1e_get_flags(l1e) & _PAGE_GNTTAB) &&
          !l1e_owner->is_shutting_down && !l1e_owner->is_dying )
     {
-        gdprintk(XENLOG_WARNING,
-                 "Attempt to implicitly unmap a granted PTE %" PRIpte "\n",
-                 l1e_get_intpte(l1e));
+        gprintk(XENLOG_WARNING,
+                "Attempt to implicitly %pd's gntmap PTE %" PRIpte "\n",
+                l1e_owner, l1e_get_intpte(l1e));
         pv_inject_hw_exception(TRAP_gp_fault, 0);
     }
 #endif

The printk() needs to not be omitted in release builds which happen to
have this logic compiled in.

~Andrew
Jan Beulich July 26, 2022, 12:04 p.m. UTC | #3
On 26.07.2022 13:51, Andrew Cooper wrote:
> On 26/07/2022 07:29, Jan Beulich wrote:
>> On 25.07.2022 19:50, Andrew Cooper wrote:
>>> This is a debug behaviour to identify buggy kernels.  Crashing the domain is
>>> the most unhelpful thing to do, because it discards the relevant context.
>>>
>>> Instead, inject #GP[0] like other permission errors in x86.  In particular,
>>> this lets the kernel provide a backtrace that's actually helpful to a
>>> developer trying to figure out what's going wrong.
>>>
>>> As a bugfix, this always injects #GP[0] to current, not l1e_owner.  It is not
>>> l1e_owner's fault if dom0 using superpowers triggers an implicit unmap.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Acked-by: Jan Beulich <jbeulich@suse.com>
>>
>> Albeit preferably with ...
>>
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -1232,7 +1232,7 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
>>>          gdprintk(XENLOG_WARNING,
>>>                   "Attempt to implicitly unmap a granted PTE %" PRIpte "\n",
>>>                   l1e_get_intpte(l1e));
>>> -        domain_crash(l1e_owner);
>>> +        pv_inject_hw_exception(TRAP_gp_fault, 0);
>>>      }
>>>  #endif
>> ... the gdprintk() adjusted to also log l1e_owner.
> 
> Ok, how about this incremental?
> 
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index b3393385ffb6..74054fb5f4ee 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -1229,9 +1229,9 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct
> domain *l1e_owner)
>      if ( (l1e_get_flags(l1e) & _PAGE_GNTTAB) &&
>           !l1e_owner->is_shutting_down && !l1e_owner->is_dying )
>      {
> -        gdprintk(XENLOG_WARNING,
> -                 "Attempt to implicitly unmap a granted PTE %" PRIpte "\n",
> -                 l1e_get_intpte(l1e));
> +        gprintk(XENLOG_WARNING,
> +                "Attempt to implicitly %pd's gntmap PTE %" PRIpte "\n",
> +                l1e_owner, l1e_get_intpte(l1e));

DYM to drop "unmap"? With it restored (or anything similar to that
effect), fine with me.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 5b81d5fbdbb2..b3393385ffb6 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1232,7 +1232,7 @@  void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
         gdprintk(XENLOG_WARNING,
                  "Attempt to implicitly unmap a granted PTE %" PRIpte "\n",
                  l1e_get_intpte(l1e));
-        domain_crash(l1e_owner);
+        pv_inject_hw_exception(TRAP_gp_fault, 0);
     }
 #endif