diff mbox series

[RESEND,3/3] rseq: Ensure SIGBUS delivered on memory failure

Message ID 20240723144752.1478226-3-andrew.zaborowski@intel.com (mailing list archive)
State New
Headers show
Series [RESEND,1/3] x86: Add task_struct flag to force SIGBUS on MCE | expand

Commit Message

Andrew Zaborowski July 23, 2024, 2:47 p.m. UTC
Uncorrected memory errors for user pages are signaled to processes
using SIGBUS or, if the error happens in a syscall, an error retval
from the syscall.  The SIGBUS is documented in
Documentation/mm/hwpoison.rst#failure-recovery-modes

Once a user task sets t->rseq in the rseq() syscall, if the kernel
cannot access the memory pointed to by t->rseq->rseq_cs, that initial
rseq() and all future syscalls should return an error so understandably
the code just kills the task.

To ensure that SIGBUS is used set the new t->kill_on_efault flag and
run queued task work on rseq_get_rseq_cs() errors to give memory_failure
the chance to run.

Note: the rseq checks run inside resume_user_mode_work() so whenever
_TIF_NOTIFY_RESUME is set.  They do not run on every syscall exit so
I'm not concerned that these extra flag operations are in a hot path,
except with CONFIG_DEBUG_RSEQ.

Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
---
 kernel/rseq.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

Comments

Kees Cook Aug. 6, 2024, 4:37 a.m. UTC | #1
On Tue, Jul 23, 2024 at 04:47:52PM +0200, Andrew Zaborowski wrote:
> Uncorrected memory errors for user pages are signaled to processes
> using SIGBUS or, if the error happens in a syscall, an error retval
> from the syscall.  The SIGBUS is documented in
> Documentation/mm/hwpoison.rst#failure-recovery-modes
> 
> Once a user task sets t->rseq in the rseq() syscall, if the kernel
> cannot access the memory pointed to by t->rseq->rseq_cs, that initial
> rseq() and all future syscalls should return an error so understandably
> the code just kills the task.
> 
> To ensure that SIGBUS is used set the new t->kill_on_efault flag and
> run queued task work on rseq_get_rseq_cs() errors to give memory_failure
> the chance to run.
> 
> Note: the rseq checks run inside resume_user_mode_work() so whenever
> _TIF_NOTIFY_RESUME is set.  They do not run on every syscall exit so
> I'm not concerned that these extra flag operations are in a hot path,
> except with CONFIG_DEBUG_RSEQ.
> 
> Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
> ---
>  kernel/rseq.c | 25 +++++++++++++++++++++----

Can an rseq maintainer please review this? I can carry it via the execve
tree with the related patches...

-Kees

>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 9de6e35fe..c5809cd13 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -13,6 +13,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/rseq.h>
>  #include <linux/types.h>
> +#include <linux/task_work.h>
>  #include <asm/ptrace.h>
>  
>  #define CREATE_TRACE_POINTS
> @@ -320,6 +321,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>  	if (unlikely(t->flags & PF_EXITING))
>  		return;
>  
> +	t->kill_on_efault = true;
> +
>  	/*
>  	 * regs is NULL if and only if the caller is in a syscall path.  Skip
>  	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> @@ -330,13 +333,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>  		if (unlikely(ret < 0))
>  			goto error;
>  	}
> -	if (unlikely(rseq_update_cpu_node_id(t)))
> -		goto error;
> -	return;
> +	if (likely(!rseq_update_cpu_node_id(t)))
> +		goto out;
>  
>  error:
> +	/* Allow task work to override signr */
> +	task_work_run();
> +
>  	sig = ksig ? ksig->sig : 0;
>  	force_sigsegv(sig);
> +
> +out:
> +	t->kill_on_efault = false;
>  }
>  
>  #ifdef CONFIG_DEBUG_RSEQ
> @@ -353,8 +361,17 @@ void rseq_syscall(struct pt_regs *regs)
>  
>  	if (!t->rseq)
>  		return;
> -	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
> +
> +	t->kill_on_efault = true;
> +
> +	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs)) {
> +		/* Allow task work to override signr */
> +		task_work_run();
> +
>  		force_sig(SIGSEGV);
> +	}
> +
> +	t->kill_on_efault = false;
>  }
>  
>  #endif
> -- 
> 2.43.0
>
Peter Zijlstra Aug. 6, 2024, 7:51 a.m. UTC | #2
On Mon, Aug 05, 2024 at 09:37:48PM -0700, Kees Cook wrote:
> On Tue, Jul 23, 2024 at 04:47:52PM +0200, Andrew Zaborowski wrote:
> > Uncorrected memory errors for user pages are signaled to processes
> > using SIGBUS or, if the error happens in a syscall, an error retval
> > from the syscall.  The SIGBUS is documented in
> > Documentation/mm/hwpoison.rst#failure-recovery-modes
> > 
> > Once a user task sets t->rseq in the rseq() syscall, if the kernel
> > cannot access the memory pointed to by t->rseq->rseq_cs, that initial
> > rseq() and all future syscalls should return an error so understandably
> > the code just kills the task.
> > 
> > To ensure that SIGBUS is used set the new t->kill_on_efault flag and
> > run queued task work on rseq_get_rseq_cs() errors to give memory_failure
> > the chance to run.
> > 
> > Note: the rseq checks run inside resume_user_mode_work() so whenever
> > _TIF_NOTIFY_RESUME is set.  They do not run on every syscall exit so
> > I'm not concerned that these extra flag operations are in a hot path,
> > except with CONFIG_DEBUG_RSEQ.
> > 
> > Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
> > ---
> >  kernel/rseq.c | 25 +++++++++++++++++++++----
> 
> Can an rseq maintainer please review this? I can carry it via the execve
> tree with the related patches...

*sigh*,.. because get_maintainers just doesn't work or something?

Anyway, I'm confused by the signal code (as always), why isn't the
task_work_run() in get_signal() sufficient?

At some point we're going to run into trouble with sprinkling
task_work_run() around willy nilly :/

> 
> >  1 file changed, 21 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index 9de6e35fe..c5809cd13 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -13,6 +13,7 @@
> >  #include <linux/syscalls.h>
> >  #include <linux/rseq.h>
> >  #include <linux/types.h>
> > +#include <linux/task_work.h>
> >  #include <asm/ptrace.h>
> >  
> >  #define CREATE_TRACE_POINTS
> > @@ -320,6 +321,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> >  	if (unlikely(t->flags & PF_EXITING))
> >  		return;
> >  
> > +	t->kill_on_efault = true;
> > +
> >  	/*
> >  	 * regs is NULL if and only if the caller is in a syscall path.  Skip
> >  	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> > @@ -330,13 +333,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> >  		if (unlikely(ret < 0))
> >  			goto error;
> >  	}
> > -	if (unlikely(rseq_update_cpu_node_id(t)))
> > -		goto error;
> > -	return;
> > +	if (likely(!rseq_update_cpu_node_id(t)))
> > +		goto out;
> >  
> >  error:
> > +	/* Allow task work to override signr */
> > +	task_work_run();
> > +
> >  	sig = ksig ? ksig->sig : 0;
> >  	force_sigsegv(sig);
> > +
> > +out:
> > +	t->kill_on_efault = false;
> >  }
> >  
> >  #ifdef CONFIG_DEBUG_RSEQ
> > @@ -353,8 +361,17 @@ void rseq_syscall(struct pt_regs *regs)
> >  
> >  	if (!t->rseq)
> >  		return;
> > -	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
> > +
> > +	t->kill_on_efault = true;
> > +
> > +	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs)) {
> > +		/* Allow task work to override signr */
> > +		task_work_run();
> > +
> >  		force_sig(SIGSEGV);
> > +	}
> > +
> > +	t->kill_on_efault = false;
> >  }
> >  
> >  #endif
> > -- 
> > 2.43.0
> > 
> 
> -- 
> Kees Cook
Mathieu Desnoyers Aug. 6, 2024, 2:19 p.m. UTC | #3
On 2024-08-06 03:51, Peter Zijlstra wrote:
> On Mon, Aug 05, 2024 at 09:37:48PM -0700, Kees Cook wrote:
>> On Tue, Jul 23, 2024 at 04:47:52PM +0200, Andrew Zaborowski wrote:
>>> Uncorrected memory errors for user pages are signaled to processes
>>> using SIGBUS or, if the error happens in a syscall, an error retval
>>> from the syscall.  The SIGBUS is documented in
>>> Documentation/mm/hwpoison.rst#failure-recovery-modes
>>>
>>> Once a user task sets t->rseq in the rseq() syscall, if the kernel
>>> cannot access the memory pointed to by t->rseq->rseq_cs, that initial
>>> rseq() and all future syscalls should return an error so understandably
>>> the code just kills the task.
>>>
>>> To ensure that SIGBUS is used set the new t->kill_on_efault flag and
>>> run queued task work on rseq_get_rseq_cs() errors to give memory_failure
>>> the chance to run.
>>>
>>> Note: the rseq checks run inside resume_user_mode_work() so whenever
>>> _TIF_NOTIFY_RESUME is set.  They do not run on every syscall exit so
>>> I'm not concerned that these extra flag operations are in a hot path,
>>> except with CONFIG_DEBUG_RSEQ.
>>>
>>> Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
>>> ---
>>>   kernel/rseq.c | 25 +++++++++++++++++++++----
>>
>> Can an rseq maintainer please review this? I can carry it via the execve
>> tree with the related patches...
> 
> *sigh*,.. because get_maintainers just doesn't work or something?
> 
> Anyway, I'm confused by the signal code (as always), why isn't the
> task_work_run() in get_signal() sufficient?
> 
> At some point we're going to run into trouble with sprinkling
> task_work_run() around willy nilly :/

I agree with Peter: adding explicit calls to task_work_run all over
the kernel does not appear to be an elegant solution.

One thing I am missing is a clear motivation for adding this code:
what is the real-world use-case that benefits from getting this SIGBUS
rather than a SIGSEGV or -EFAULT ?

Also, I feel like we should investigate turning SIGSEGV into SIGBUS
at signal delivery rather than handle this here and there in the kernel
code.

Thoughts ?

Thanks,

Mathieu
diff mbox series

Patch

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 9de6e35fe..c5809cd13 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -13,6 +13,7 @@ 
 #include <linux/syscalls.h>
 #include <linux/rseq.h>
 #include <linux/types.h>
+#include <linux/task_work.h>
 #include <asm/ptrace.h>
 
 #define CREATE_TRACE_POINTS
@@ -320,6 +321,8 @@  void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 	if (unlikely(t->flags & PF_EXITING))
 		return;
 
+	t->kill_on_efault = true;
+
 	/*
 	 * regs is NULL if and only if the caller is in a syscall path.  Skip
 	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
@@ -330,13 +333,18 @@  void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 		if (unlikely(ret < 0))
 			goto error;
 	}
-	if (unlikely(rseq_update_cpu_node_id(t)))
-		goto error;
-	return;
+	if (likely(!rseq_update_cpu_node_id(t)))
+		goto out;
 
 error:
+	/* Allow task work to override signr */
+	task_work_run();
+
 	sig = ksig ? ksig->sig : 0;
 	force_sigsegv(sig);
+
+out:
+	t->kill_on_efault = false;
 }
 
 #ifdef CONFIG_DEBUG_RSEQ
@@ -353,8 +361,17 @@  void rseq_syscall(struct pt_regs *regs)
 
 	if (!t->rseq)
 		return;
-	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
+
+	t->kill_on_efault = true;
+
+	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs)) {
+		/* Allow task work to override signr */
+		task_work_run();
+
 		force_sig(SIGSEGV);
+	}
+
+	t->kill_on_efault = false;
 }
 
 #endif