diff mbox series

[v3] fs/proc: Expose RSEQ configuration

Message ID 20210126185412.175204-1-figiel@google.com (mailing list archive)
State New
Headers show
Series [v3] fs/proc: Expose RSEQ configuration | expand

Commit Message

Piotr Figiel Jan. 26, 2021, 6:54 p.m. UTC
For userspace checkpoint and restore (C/R) some way of getting process
state containing RSEQ configuration is needed.

There are two ways this information is going to be used:
 - to re-enable RSEQ for threads which had it enabled before C/R
 - to detect if a thread was in a critical section during C/R

Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
using the address registered before C/R.

Detection whether the thread is in a critical section during C/R is
needed to enforce behavior of RSEQ abort during C/R. Attaching with
ptrace() before registers are dumped itself doesn't cause RSEQ abort.
Restoring the instruction pointer within the critical section is
problematic because rseq_cs may get cleared before the control is
passed to the migrated application code leading to RSEQ invariants not
being preserved.

To achieve above goals expose the RSEQ structure address and the
signature value with the new per-thread procfs file "rseq".

Signed-off-by: Piotr Figiel <figiel@google.com>

---

v3:
 - added locking so that the proc file always shows consistent pair of
   RSEQ ABI address and the signature
 - changed string formatting to use %px for the RSEQ ABI address
v2:
 - fixed string formatting for 32-bit architectures
v1:
 - https://lkml.kernel.org/r/20210113174127.2500051-1-figiel@google.com

---
 fs/exec.c      |  2 ++
 fs/proc/base.c | 22 ++++++++++++++++++++++
 kernel/rseq.c  |  4 ++++
 3 files changed, 28 insertions(+)

Comments

Andrew Morton Jan. 26, 2021, 7:25 p.m. UTC | #1
On Tue, 26 Jan 2021 19:54:12 +0100 Piotr Figiel <figiel@google.com> wrote:

> For userspace checkpoint and restore (C/R) some way of getting process
> state containing RSEQ configuration is needed.
> 
> There are two ways this information is going to be used:
>  - to re-enable RSEQ for threads which had it enabled before C/R
>  - to detect if a thread was in a critical section during C/R
> 
> Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
> using the address registered before C/R.
> 
> Detection whether the thread is in a critical section during C/R is
> needed to enforce behavior of RSEQ abort during C/R. Attaching with
> ptrace() before registers are dumped itself doesn't cause RSEQ abort.
> Restoring the instruction pointer within the critical section is
> problematic because rseq_cs may get cleared before the control is
> passed to the migrated application code leading to RSEQ invariants not
> being preserved.
> 
> To achieve above goals expose the RSEQ structure address and the
> signature value with the new per-thread procfs file "rseq".

Using "/proc/<pid>/rseq" would be more informative.

>  fs/exec.c      |  2 ++
>  fs/proc/base.c | 22 ++++++++++++++++++++++
>  kernel/rseq.c  |  4 ++++

A Documentation/ update would be appropriate.

>  3 files changed, 28 insertions(+)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 5d4d52039105..5d84f98847f1 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1830,7 +1830,9 @@ static int bprm_execve(struct linux_binprm *bprm,
>  	/* execve succeeded */
>  	current->fs->in_exec = 0;
>  	current->in_execve = 0;
> +	task_lock(current);
>  	rseq_execve(current);
> +	task_unlock(current);

There's a comment over the task_lock() implementation which explains
what things it locks.  An update to that would be helpful.

> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -662,6 +662,22 @@ static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
>  
>  	return 0;
>  }
> +
> +#ifdef CONFIG_RSEQ
> +static int proc_pid_rseq(struct seq_file *m, struct pid_namespace *ns,
> +				struct pid *pid, struct task_struct *task)
> +{
> +	int res = lock_trace(task);
> +
> +	if (res)
> +		return res;
> +	task_lock(task);
> +	seq_printf(m, "%px %08x\n", task->rseq, task->rseq_sig);
> +	task_unlock(task);
> +	unlock_trace(task);
> +	return 0;
> +}

Do we actually need task_lock() for this purpose?  Would
exec_update_lock() alone be adequate and appropriate?
Mathieu Desnoyers Jan. 26, 2021, 8:58 p.m. UTC | #2
----- On Jan 26, 2021, at 1:54 PM, Piotr Figiel figiel@google.com wrote:
[...]
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index a4f86a9d6937..6aea67878065 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -322,8 +322,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32,
> rseq_len,
> 		ret = rseq_reset_rseq_cpu_id(current);
> 		if (ret)
> 			return ret;
> +		task_lock(current);
> 		current->rseq = NULL;
> 		current->rseq_sig = 0;
> +		task_unlock(current);
> 		return 0;
> 	}
> 
> @@ -353,8 +355,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32,
> rseq_len,
> 		return -EINVAL;
> 	if (!access_ok(rseq, rseq_len))
> 		return -EFAULT;
> +	task_lock(current);
> 	current->rseq = rseq;
> 	current->rseq_sig = sig;
> +	task_unlock(current);

So AFAIU, the locks are there to make sure that whenever a user-space thread reads
that state through that new /proc file ABI, it observes coherent "rseq" vs "rseq_sig"
values. However, I'm not convinced this is the right approach to consistency here.

Because if you add locking as done here, you ensure that the /proc file reader
sees coherent values, but between the point where those values are read from
kernel-space, copied to user-space, and then acted upon by user-space, those can
very well have become outdated if the observed process runs concurrently.

So my understanding here is that the only non-racy way to effectively use those
values is to either read them from /proc/self/* (from the thread owning the task struct),
or to ensure that the thread is stopped/frozen while the read is done.

Maybe we should consider validating that the proc file is used from the right context
(from self or when the target thread is stopped/frozen) rather than add dubious locking ?

Thanks,

Mathieu
Piotr Figiel Jan. 27, 2021, 3:07 p.m. UTC | #3
On Tue, Jan 26, 2021 at 11:25:47AM -0800, Andrew Morton wrote:
> On Tue, 26 Jan 2021 19:54:12 +0100 Piotr Figiel <figiel@google.com> wrote:
> > To achieve above goals expose the RSEQ structure address and the
> > signature value with the new per-thread procfs file "rseq".
> Using "/proc/<pid>/rseq" would be more informative.
> 
> >  fs/exec.c      |  2 ++
> >  fs/proc/base.c | 22 ++++++++++++++++++++++
> >  kernel/rseq.c  |  4 ++++
> 
> A Documentation/ update would be appropriate.
> 
> > +	task_lock(current);
> >  	rseq_execve(current);
> > +	task_unlock(current);
> 
> There's a comment over the task_lock() implementation which explains
> what things it locks.  An update to that would be helpful.

Agreed I'll include fixes for above comments in v4.

> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -662,6 +662,22 @@ static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
> >  
> >  	return 0;
> >  }
> > +
> > +#ifdef CONFIG_RSEQ
> > +static int proc_pid_rseq(struct seq_file *m, struct pid_namespace *ns,
> > +				struct pid *pid, struct task_struct *task)
> > +{
> > +	int res = lock_trace(task);
> > +
> > +	if (res)
> > +		return res;
> > +	task_lock(task);
> > +	seq_printf(m, "%px %08x\n", task->rseq, task->rseq_sig);
> > +	task_unlock(task);
> > +	unlock_trace(task);
> > +	return 0;
> > +}
> 
> Do we actually need task_lock() for this purpose?  Would
> exec_update_lock() alone be adequate and appropriate?

Now rseq syscall which modifies those fields isn't synchronised with
exec_update_lock. So either a new lock or task_lock() could be used or
exec_update_lock could be reused in the syscall.  I decided against
exec_update_lock reuse in the syscall because it's normally used to
guard access checks against concurrent setuid exec. This could be
potentially confusing as it's not relevant for the the rseq syscall
code.
I think task_lock usage here is also consistent with how it's used
across the kernel.

Whether we need consistent rseq and rseq_sig pairs in the proc output, I
think there's some argument for it (discussed also in parallel thread
with Mathieu Desnoyers).

Best regards,
Piotr.
Piotr Figiel Jan. 27, 2021, 3:14 p.m. UTC | #4
On Tue, Jan 26, 2021 at 03:58:46PM -0500, Mathieu Desnoyers wrote:
> ----- On Jan 26, 2021, at 1:54 PM, Piotr Figiel figiel@google.com wrote:
> [...]
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index a4f86a9d6937..6aea67878065 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -322,8 +322,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32,
> > rseq_len,
> > 		ret = rseq_reset_rseq_cpu_id(current);
> > 		if (ret)
> > 			return ret;
> > +		task_lock(current);
> > 		current->rseq = NULL;
> > 		current->rseq_sig = 0;
> > +		task_unlock(current);
> > 		return 0;
> > 	}
> > 
> > @@ -353,8 +355,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32,
> > rseq_len,
> > 		return -EINVAL;
> > 	if (!access_ok(rseq, rseq_len))
> > 		return -EFAULT;
> > +	task_lock(current);
> > 	current->rseq = rseq;
> > 	current->rseq_sig = sig;
> > +	task_unlock(current);
> 
> So AFAIU, the locks are there to make sure that whenever a user-space
> thread reads that state through that new /proc file ABI, it observes
> coherent "rseq" vs "rseq_sig" values.

Yes, that was the intention.

> However, I'm not convinced this is the right approach to consistency
> here.
> 
> Because if you add locking as done here, you ensure that the /proc
> file reader sees coherent values, but between the point where those
> values are read from kernel-space, copied to user-space, and then
> acted upon by user-space, those can very well have become outdated if
> the observed process runs concurrently.

You are right here, but I think this comment is valid for most of the
process information exported via procfs. The user can almost always make
a time of check/time of use race when interacting with procfs. I agree
that the locking added in v3 doesn't help much, but at least it does
provide a well defined answer: i.e. at least in some point of time the
effective configuration was as returned.
It makes it a bit easier to document and reason about the file contents,
compared to the inconsistent version.

> So my understanding here is that the only non-racy way to effectively
> use those values is to either read them from /proc/self/* (from the
> thread owning the task struct), or to ensure that the thread is
> stopped/frozen while the read is done.

Constraining this solely to the owning thread I think is a bit too
limiting. I think we could limit it to stopped threads but I don't think
it eliminates the potential of time of check/time of use races for the
user.  In this shape as in v3 - it's up to the user to decide if there
is a relevant risk of a race, if it's unwanted then the thread can be
stopped with e.g. ptrace, cgroup freeze or SIGSTOP.

Best regards,
Piotr.
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index 5d4d52039105..5d84f98847f1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1830,7 +1830,9 @@  static int bprm_execve(struct linux_binprm *bprm,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	task_lock(current);
 	rseq_execve(current);
+	task_unlock(current);
 	acct_update_integrals(current);
 	task_numa_free(current, false);
 	return retval;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index b3422cda2a91..89232329d966 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -662,6 +662,22 @@  static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
 
 	return 0;
 }
+
+#ifdef CONFIG_RSEQ
+static int proc_pid_rseq(struct seq_file *m, struct pid_namespace *ns,
+				struct pid *pid, struct task_struct *task)
+{
+	int res = lock_trace(task);
+
+	if (res)
+		return res;
+	task_lock(task);
+	seq_printf(m, "%px %08x\n", task->rseq, task->rseq_sig);
+	task_unlock(task);
+	unlock_trace(task);
+	return 0;
+}
+#endif /* CONFIG_RSEQ */
 #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
 
 /************************************************************************/
@@ -3182,6 +3198,9 @@  static const struct pid_entry tgid_base_stuff[] = {
 	REG("comm",      S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
 	ONE("syscall",    S_IRUSR, proc_pid_syscall),
+#ifdef CONFIG_RSEQ
+	ONE("rseq",       S_IRUSR, proc_pid_rseq),
+#endif
 #endif
 	REG("cmdline",    S_IRUGO, proc_pid_cmdline_ops),
 	ONE("stat",       S_IRUGO, proc_tgid_stat),
@@ -3522,6 +3541,9 @@  static const struct pid_entry tid_base_stuff[] = {
 			 &proc_pid_set_comm_operations, {}),
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
 	ONE("syscall",   S_IRUSR, proc_pid_syscall),
+#ifdef CONFIG_RSEQ
+	ONE("rseq",      S_IRUSR, proc_pid_rseq),
+#endif
 #endif
 	REG("cmdline",   S_IRUGO, proc_pid_cmdline_ops),
 	ONE("stat",      S_IRUGO, proc_tid_stat),
diff --git a/kernel/rseq.c b/kernel/rseq.c
index a4f86a9d6937..6aea67878065 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -322,8 +322,10 @@  SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		ret = rseq_reset_rseq_cpu_id(current);
 		if (ret)
 			return ret;
+		task_lock(current);
 		current->rseq = NULL;
 		current->rseq_sig = 0;
+		task_unlock(current);
 		return 0;
 	}
 
@@ -353,8 +355,10 @@  SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		return -EINVAL;
 	if (!access_ok(rseq, rseq_len))
 		return -EFAULT;
+	task_lock(current);
 	current->rseq = rseq;
 	current->rseq_sig = sig;
+	task_unlock(current);
 	/*
 	 * If rseq was previously inactive, and has just been
 	 * registered, ensure the cpu_id_start and cpu_id fields