diff mbox series

[v2,5/6] x86/alternatives: do not BUG during apply

Message ID 20240925084239.85649-6-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series xen/livepatch: improvements to loading | expand

Commit Message

Roger Pau Monne Sept. 25, 2024, 8:42 a.m. UTC
alternatives is used both at boot time, and when loading livepatch payloads.
While for the former it makes sense to panic, it's not useful for the later, as
for livepatches it's possible to fail to load the livepatch if alternatives
cannot be resolved and continue operating normally.

Relax the BUGs in _apply_alternatives() to instead return an error code.  The
caller will figure out whether the failures are fatal and panic.

Print an error message to provide some user-readable information about what
went wrong.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Unconditionally return from _apply_alternative() and let the caller panic
   if required.
 - Remove test, as next patch imposes restrictions that break the test.
---
 xen/arch/x86/alternative.c             | 46 ++++++++++++++++++++------
 xen/arch/x86/include/asm/alternative.h |  2 +-
 xen/common/livepatch.c                 | 10 +++++-
 3 files changed, 46 insertions(+), 12 deletions(-)

Comments

Jan Beulich Sept. 25, 2024, 9:01 a.m. UTC | #1
On 25.09.2024 10:42, Roger Pau Monne wrote:
> alternatives is used both at boot time, and when loading livepatch payloads.
> While for the former it makes sense to panic, it's not useful for the later, as
> for livepatches it's possible to fail to load the livepatch if alternatives
> cannot be resolved and continue operating normally.
> 
> Relax the BUGs in _apply_alternatives() to instead return an error code.  The
> caller will figure out whether the failures are fatal and panic.
> 
> Print an error message to provide some user-readable information about what
> went wrong.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Albeit ...

> @@ -394,8 +418,10 @@ static int __init cf_check nmi_apply_alternatives(
>                                   PAGE_HYPERVISOR_RWX);
>          flush_local(FLUSH_TLB_GLOBAL);
>  
> -        _apply_alternatives(__alt_instructions, __alt_instructions_end,
> -                            alt_done);
> +        rc = _apply_alternatives(__alt_instructions, __alt_instructions_end,
> +                                 alt_done);
> +        if ( rc )
> +            panic("Unable to apply alternatives: %d\n", rc);

... I see little value in logging rc here - the other log message will
provide better detail anyway.

> --- a/xen/common/livepatch.c
> +++ b/xen/common/livepatch.c
> @@ -896,7 +896,15 @@ static int prepare_payload(struct payload *payload,
>                  return -EINVAL;
>              }
>          }
> -        apply_alternatives(start, end);
> +
> +        rc = apply_alternatives(start, end);
> +        if ( rc )
> +        {
> +            printk(XENLOG_ERR LIVEPATCH "%s applying alternatives failed: %d\n",
> +                   elf->name, rc);
> +            return rc;
> +        }

Whereas here it may indeed make sense to keep things as you have them, as the
error code is propagated onwards, and matching it with other error messages
coming from elsewhere may help in understanding the situation.

As to possible applying: It looks as if this was independent of the earlier 4
patches?

Jan
Roger Pau Monne Sept. 25, 2024, 9:28 a.m. UTC | #2
On Wed, Sep 25, 2024 at 11:01:08AM +0200, Jan Beulich wrote:
> On 25.09.2024 10:42, Roger Pau Monne wrote:
> > alternatives is used both at boot time, and when loading livepatch payloads.
> > While for the former it makes sense to panic, it's not useful for the later, as
> > for livepatches it's possible to fail to load the livepatch if alternatives
> > cannot be resolved and continue operating normally.
> > 
> > Relax the BUGs in _apply_alternatives() to instead return an error code.  The
> > caller will figure out whether the failures are fatal and panic.
> > 
> > Print an error message to provide some user-readable information about what
> > went wrong.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
> Albeit ...
> 
> > @@ -394,8 +418,10 @@ static int __init cf_check nmi_apply_alternatives(
> >                                   PAGE_HYPERVISOR_RWX);
> >          flush_local(FLUSH_TLB_GLOBAL);
> >  
> > -        _apply_alternatives(__alt_instructions, __alt_instructions_end,
> > -                            alt_done);
> > +        rc = _apply_alternatives(__alt_instructions, __alt_instructions_end,
> > +                                 alt_done);
> > +        if ( rc )
> > +            panic("Unable to apply alternatives: %d\n", rc);
> 
> ... I see little value in logging rc here - the other log message will
> provide better detail anyway.

Current log messages do indeed provide better detail, but maybe we end
up adding new return error paths to _apply_alternatives() in the
future.  I see no harm in printing the error code if there's one.

> > --- a/xen/common/livepatch.c
> > +++ b/xen/common/livepatch.c
> > @@ -896,7 +896,15 @@ static int prepare_payload(struct payload *payload,
> >                  return -EINVAL;
> >              }
> >          }
> > -        apply_alternatives(start, end);
> > +
> > +        rc = apply_alternatives(start, end);
> > +        if ( rc )
> > +        {
> > +            printk(XENLOG_ERR LIVEPATCH "%s applying alternatives failed: %d\n",
> > +                   elf->name, rc);
> > +            return rc;
> > +        }
> 
> Whereas here it may indeed make sense to keep things as you have them, as the
> error code is propagated onwards, and matching it with other error messages
> coming from elsewhere may help in understanding the situation.
> 
> As to possible applying: It looks as if this was independent of the earlier 4
> patches?

Yes, I think patches 5 and 6 can be applied ahead of the preceding
livepatch fixes.

Thanks, Roger.
Jan Beulich Sept. 25, 2024, 10:02 a.m. UTC | #3
On 25.09.2024 11:28, Roger Pau Monné wrote:
> On Wed, Sep 25, 2024 at 11:01:08AM +0200, Jan Beulich wrote:
>> On 25.09.2024 10:42, Roger Pau Monne wrote:
>>> alternatives is used both at boot time, and when loading livepatch payloads.
>>> While for the former it makes sense to panic, it's not useful for the later, as
>>> for livepatches it's possible to fail to load the livepatch if alternatives
>>> cannot be resolved and continue operating normally.
>>>
>>> Relax the BUGs in _apply_alternatives() to instead return an error code.  The
>>> caller will figure out whether the failures are fatal and panic.
>>>
>>> Print an error message to provide some user-readable information about what
>>> went wrong.
>>>
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>
>> Albeit ...
>>
>>> @@ -394,8 +418,10 @@ static int __init cf_check nmi_apply_alternatives(
>>>                                   PAGE_HYPERVISOR_RWX);
>>>          flush_local(FLUSH_TLB_GLOBAL);
>>>  
>>> -        _apply_alternatives(__alt_instructions, __alt_instructions_end,
>>> -                            alt_done);
>>> +        rc = _apply_alternatives(__alt_instructions, __alt_instructions_end,
>>> +                                 alt_done);
>>> +        if ( rc )
>>> +            panic("Unable to apply alternatives: %d\n", rc);
>>
>> ... I see little value in logging rc here - the other log message will
>> provide better detail anyway.
> 
> Current log messages do indeed provide better detail, but maybe we end
> up adding new return error paths to _apply_alternatives() in the
> future.

I'd expect such error paths to then also have dedicated logging.

>  I see no harm in printing the error code if there's one.

Well, it's not much harm indeed, yet imo extra data logged also normally
wants to have a reason for the logging. After if you look at the log,
you'd expect every detail to tell you something (useful; in some certain
cases at least). Anyway - I don't mean to insist on the removal, it just
looked pretty useless to me.

Jan
Andrew Cooper Sept. 25, 2024, 10:25 a.m. UTC | #4
On 25/09/2024 11:02 am, Jan Beulich wrote:
> On 25.09.2024 11:28, Roger Pau Monné wrote:
>> On Wed, Sep 25, 2024 at 11:01:08AM +0200, Jan Beulich wrote:
>>> On 25.09.2024 10:42, Roger Pau Monne wrote:
>>>> alternatives is used both at boot time, and when loading livepatch payloads.
>>>> While for the former it makes sense to panic, it's not useful for the later, as
>>>> for livepatches it's possible to fail to load the livepatch if alternatives
>>>> cannot be resolved and continue operating normally.
>>>>
>>>> Relax the BUGs in _apply_alternatives() to instead return an error code.  The
>>>> caller will figure out whether the failures are fatal and panic.
>>>>
>>>> Print an error message to provide some user-readable information about what
>>>> went wrong.
>>>>
>>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> Albeit ...
>>>
>>>> @@ -394,8 +418,10 @@ static int __init cf_check nmi_apply_alternatives(
>>>>                                   PAGE_HYPERVISOR_RWX);
>>>>          flush_local(FLUSH_TLB_GLOBAL);
>>>>  
>>>> -        _apply_alternatives(__alt_instructions, __alt_instructions_end,
>>>> -                            alt_done);
>>>> +        rc = _apply_alternatives(__alt_instructions, __alt_instructions_end,
>>>> +                                 alt_done);
>>>> +        if ( rc )
>>>> +            panic("Unable to apply alternatives: %d\n", rc);
>>> ... I see little value in logging rc here - the other log message will
>>> provide better detail anyway.
>> Current log messages do indeed provide better detail, but maybe we end
>> up adding new return error paths to _apply_alternatives() in the
>> future.
> I'd expect such error paths to then also have dedicated logging.
>
>>  I see no harm in printing the error code if there's one.
> Well, it's not much harm indeed, yet imo extra data logged also normally
> wants to have a reason for the logging. After if you look at the log,
> you'd expect every detail to tell you something (useful; in some certain
> cases at least). Anyway - I don't mean to insist on the removal, it just
> looked pretty useless to me.

People reading the logs can ignore bits they don't think are useful.

What they cannot do is read data which should have been there but wasn't.

~Andrew
Andrew Cooper Sept. 25, 2024, 10:53 a.m. UTC | #5
On 25/09/2024 9:42 am, Roger Pau Monne wrote:
> alternatives is used both at boot time, and when loading livepatch payloads.
> While for the former it makes sense to panic, it's not useful for the later, as
> for livepatches it's possible to fail to load the livepatch if alternatives
> cannot be resolved and continue operating normally.
>
> Relax the BUGs in _apply_alternatives() to instead return an error code.  The
> caller will figure out whether the failures are fatal and panic.
>
> Print an error message to provide some user-readable information about what
> went wrong.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Much nicer.  A few more diagnostic comments.

> diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
> index 7824053c9d33..c8848ba6006e 100644
> --- a/xen/arch/x86/alternative.c
> +++ b/xen/arch/x86/alternative.c
> @@ -198,9 +198,29 @@ static void init_or_livepatch _apply_alternatives(struct alt_instr *start,
>          uint8_t buf[MAX_PATCH_LEN];
>          unsigned int total_len = a->orig_len + a->pad_len;
>  
> -        BUG_ON(a->repl_len > total_len);
> -        BUG_ON(total_len > sizeof(buf));
> -        BUG_ON(a->cpuid >= NCAPINTS * 32);
> +        if ( a->repl_len > total_len )
> +        {
> +            printk(XENLOG_ERR
> +                   "alt replacement size (%#x) bigger than destination (%#x)\n",

These all say "some alternative went wrong", without identifying which. 
For x86_decode_lite(), debugging was far easier when using:

"Alternative for %ps ...", ALT_ORIG_PTR(a)

If we get the order of loading correct, then %ps will even work for a
livepatch, but that's secondary - even the raw number is slightly useful
given the livepatch load address.

I could be talked down to just "Alt for %ps" as you've got it here.  I
think it's clear enough in context.  So, I'd recommend:

"Alt for %ps, replacement size %#x larger than origin %#x\n".

Here, I think origin is better than destination, when discussing
alternatives.

I can adjust on commit.  Everything else is fine.

~Andrew
Roger Pau Monne Sept. 25, 2024, 1:55 p.m. UTC | #6
On Wed, Sep 25, 2024 at 11:53:30AM +0100, Andrew Cooper wrote:
> On 25/09/2024 9:42 am, Roger Pau Monne wrote:
> > alternatives is used both at boot time, and when loading livepatch payloads.
> > While for the former it makes sense to panic, it's not useful for the later, as
> > for livepatches it's possible to fail to load the livepatch if alternatives
> > cannot be resolved and continue operating normally.
> >
> > Relax the BUGs in _apply_alternatives() to instead return an error code.  The
> > caller will figure out whether the failures are fatal and panic.
> >
> > Print an error message to provide some user-readable information about what
> > went wrong.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Much nicer.  A few more diagnostic comments.
> 
> > diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
> > index 7824053c9d33..c8848ba6006e 100644
> > --- a/xen/arch/x86/alternative.c
> > +++ b/xen/arch/x86/alternative.c
> > @@ -198,9 +198,29 @@ static void init_or_livepatch _apply_alternatives(struct alt_instr *start,
> >          uint8_t buf[MAX_PATCH_LEN];
> >          unsigned int total_len = a->orig_len + a->pad_len;
> >  
> > -        BUG_ON(a->repl_len > total_len);
> > -        BUG_ON(total_len > sizeof(buf));
> > -        BUG_ON(a->cpuid >= NCAPINTS * 32);
> > +        if ( a->repl_len > total_len )
> > +        {
> > +            printk(XENLOG_ERR
> > +                   "alt replacement size (%#x) bigger than destination (%#x)\n",
> 
> These all say "some alternative went wrong", without identifying which. 
> For x86_decode_lite(), debugging was far easier when using:
> 
> "Alternative for %ps ...", ALT_ORIG_PTR(a)
> 
> If we get the order of loading correct, then %ps will even work for a
> livepatch, but that's secondary - even the raw number is slightly useful
> given the livepatch load address.

I don't think this will work as-is for livepatches.  The call to
register the virtual region is currently done in livepatch_upload(),
after load_payload_data() has completed.

We could see about registering the virtual region earlier (no
volunteering to do that work right now).

> I could be talked down to just "Alt for %ps" as you've got it here.  I
> think it's clear enough in context.  So, I'd recommend:
> 
> "Alt for %ps, replacement size %#x larger than origin %#x\n".
> 
> Here, I think origin is better than destination, when discussing
> alternatives.

Sure.

> I can adjust on commit.  Everything else is fine.

If you are comfortable with doing the adjustments on commit, please go
ahead.

Thanks, Roger.
diff mbox series

Patch

diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 7824053c9d33..c8848ba6006e 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -175,9 +175,9 @@  extern void *const __initdata_cf_clobber_end[];
  * invocation, such that no CALLs/JMPs to NULL pointers will be left
  * around. See also the further comment below.
  */
-static void init_or_livepatch _apply_alternatives(struct alt_instr *start,
-                                                  struct alt_instr *end,
-                                                  bool force)
+static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
+                                                 struct alt_instr *end,
+                                                 bool force)
 {
     struct alt_instr *a, *base;
 
@@ -198,9 +198,29 @@  static void init_or_livepatch _apply_alternatives(struct alt_instr *start,
         uint8_t buf[MAX_PATCH_LEN];
         unsigned int total_len = a->orig_len + a->pad_len;
 
-        BUG_ON(a->repl_len > total_len);
-        BUG_ON(total_len > sizeof(buf));
-        BUG_ON(a->cpuid >= NCAPINTS * 32);
+        if ( a->repl_len > total_len )
+        {
+            printk(XENLOG_ERR
+                   "alt replacement size (%#x) bigger than destination (%#x)\n",
+                   a->repl_len, total_len);
+            return -ENOSPC;
+        }
+
+        if ( total_len > sizeof(buf) )
+        {
+            printk(XENLOG_ERR
+                   "alt destination size (%#x) bigger than buffer (%#zx)\n",
+                   total_len, sizeof(buf));
+            return -ENOSPC;
+        }
+
+        if ( a->cpuid >= NCAPINTS * 32 )
+        {
+             printk(XENLOG_ERR
+                   "alt CPU feature (%#x) outside of featureset range (%#x)\n",
+                   a->cpuid, NCAPINTS * 32);
+            return -ERANGE;
+        }
 
         /*
          * Detect sequences of alt_instr's patching the same origin site, and
@@ -356,12 +376,14 @@  static void init_or_livepatch _apply_alternatives(struct alt_instr *start,
 
         printk("altcall: Optimised away %u endbr64 instructions\n", clobbered);
     }
+
+    return 0;
 }
 
 #ifdef CONFIG_LIVEPATCH
-void apply_alternatives(struct alt_instr *start, struct alt_instr *end)
+int apply_alternatives(struct alt_instr *start, struct alt_instr *end)
 {
-    _apply_alternatives(start, end, true);
+    return _apply_alternatives(start, end, true);
 }
 #endif
 
@@ -383,6 +405,8 @@  static int __init cf_check nmi_apply_alternatives(
      */
     if ( !(alt_done & alt_todo) )
     {
+        int rc;
+
         /*
          * Relax perms on .text to be RWX, so we can modify them.
          *
@@ -394,8 +418,10 @@  static int __init cf_check nmi_apply_alternatives(
                                  PAGE_HYPERVISOR_RWX);
         flush_local(FLUSH_TLB_GLOBAL);
 
-        _apply_alternatives(__alt_instructions, __alt_instructions_end,
-                            alt_done);
+        rc = _apply_alternatives(__alt_instructions, __alt_instructions_end,
+                                 alt_done);
+        if ( rc )
+            panic("Unable to apply alternatives: %d\n", rc);
 
         /*
          * Reinstate perms on .text to be RX.  This also cleans out the dirty
diff --git a/xen/arch/x86/include/asm/alternative.h b/xen/arch/x86/include/asm/alternative.h
index a86eadfaecbd..69555d781ef9 100644
--- a/xen/arch/x86/include/asm/alternative.h
+++ b/xen/arch/x86/include/asm/alternative.h
@@ -24,7 +24,7 @@  struct __packed alt_instr {
 
 extern void add_nops(void *insns, unsigned int len);
 /* Similar to alternative_instructions except it can be run with IRQs enabled. */
-extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
+extern int apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 extern void alternative_instructions(void);
 extern void alternative_branches(void);
 
diff --git a/xen/common/livepatch.c b/xen/common/livepatch.c
index 895c425cd5ea..c777f64d88d4 100644
--- a/xen/common/livepatch.c
+++ b/xen/common/livepatch.c
@@ -896,7 +896,15 @@  static int prepare_payload(struct payload *payload,
                 return -EINVAL;
             }
         }
-        apply_alternatives(start, end);
+
+        rc = apply_alternatives(start, end);
+        if ( rc )
+        {
+            printk(XENLOG_ERR LIVEPATCH "%s applying alternatives failed: %d\n",
+                   elf->name, rc);
+            return rc;
+        }
+
     alt_done:;
 #else
         printk(XENLOG_ERR LIVEPATCH "%s: We don't support alternative patching\n",