diff mbox series

[v2] ucounts: Fix signal ucount refcounting

Message ID 87fssytizw.fsf_-_@disp2133 (mailing list archive)
State New
Headers show
Series [v2] ucounts: Fix signal ucount refcounting | expand

Commit Message

Eric W. Biederman Oct. 18, 2021, 4:06 p.m. UTC
In commit fda31c50292a ("signal: avoid double atomic counter
increments for user accounting") Linus made a clever optimization to
how rlimits and the struct user_struct.  Unfortunately that
optimization does not work in the obvious way when moved to nested
rlimits.  The problem is that the last decrement of the per user
namespace per user sigpending counter might also be the last decrement
of the sigpending counter in the parent user namespace as well.  Which
means that simply freeing the leaf ucount in __free_sigqueue is not
enough.

Maintain the optimization and handle the tricky cases by introducing
inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.

By moving the entire optimization into functions that perform all of
the work it becomes possible to ensure that every level is handled
properly.

The new function inc_rlimit_get_ucounts returns 0 on failure to
increment the ucount.  This is different than inc_rlimit_ucounts which
increments the ucounts and returns LONG_MAX if the ucount counter has
exceeded it's maximum or it wrapped (to indicate the counter needs to
decremented).

I wish we had a single user to account all pending signals to across
all of the threads of a process so this complexity was not necessary

Cc: stable@vger.kernel.org
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133
Tested-by: Rune Kleveland <rune.kleveland@infomedia.dk>
Tested-by: Yu Zhao <yuzhao@google.com>
Tested-by: Jordan Glover <Golden_Miller83@protonmail.ch>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

This is my version of the fix with all of the feedback rolled in.
I have tested it and believe this is ready to send out.

If folks code take a once over and see if I have spotted things.

For the people who are testing or have tested this I have added your
tested-by's please let me know if you mind.

Eric


 include/linux/user_namespace.h |  2 ++
 kernel/signal.c                | 25 +++++------------
 kernel/ucount.c                | 49 ++++++++++++++++++++++++++++++++++
 3 files changed, 57 insertions(+), 19 deletions(-)

Comments

Yu Zhao Oct. 18, 2021, 10:26 p.m. UTC | #1
On Mon, Oct 18, 2021 at 10:06 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
>
> In commit fda31c50292a ("signal: avoid double atomic counter
> increments for user accounting") Linus made a clever optimization to
> how rlimits and the struct user_struct.  Unfortunately that
> optimization does not work in the obvious way when moved to nested
> rlimits.  The problem is that the last decrement of the per user
> namespace per user sigpending counter might also be the last decrement
> of the sigpending counter in the parent user namespace as well.  Which
> means that simply freeing the leaf ucount in __free_sigqueue is not
> enough.
>
> Maintain the optimization and handle the tricky cases by introducing
> inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
>
> By moving the entire optimization into functions that perform all of
> the work it becomes possible to ensure that every level is handled
> properly.
>
> The new function inc_rlimit_get_ucounts returns 0 on failure to
> increment the ucount.  This is different than inc_rlimit_ucounts which
> increments the ucounts and returns LONG_MAX if the ucount counter has
> exceeded it's maximum or it wrapped (to indicate the counter needs to
> decremented).
>
> I wish we had a single user to account all pending signals to across
> all of the threads of a process so this complexity was not necessary
>
> Cc: stable@vger.kernel.org
> Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
> v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133
> Tested-by: Rune Kleveland <rune.kleveland@infomedia.dk>
> Tested-by: Yu Zhao <yuzhao@google.com>
> Tested-by: Jordan Glover <Golden_Miller83@protonmail.ch>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>
> This is my version of the fix with all of the feedback rolled in.
> I have tested it and believe this is ready to send out.
>
> If folks code take a once over and see if I have spotted things.
>
> For the people who are testing or have tested this I have added your
> tested-by's please let me know if you mind.
>
> Eric

Retested on the latest 5.15-rc6. This patch fixes the following crash:

[ 3307.621443] ==================================================================
[ 3307.628890] BUG: KASAN: use-after-free in dec_ucount+0x50/0xd8
[ 3307.634903] Write of size 8 at addr ffffff80a5e4c520 by task kworker/7:3/201
[ 3307.642149]
[ 3307.643695] CPU: 7 PID: 201 Comm: kworker/7:3 Not tainted
5.15.0-rc6-lockdep+ #7
[ 3307.651291] Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
[ 3307.658001] Workqueue: events free_user_ns
[ 3307.662231] Call trace:
[ 3307.664750]  dump_backtrace+0x0/0x42c
[ 3307.668522]  show_stack+0x24/0x30
[ 3307.671935]  dump_stack_lvl+0xd0/0x100
[ 3307.675797]  print_address_description+0x30/0x304
[ 3307.680646]  kasan_report+0x190/0x1d8
[ 3307.684419]  kasan_check_range+0x1ac/0x1bc
[ 3307.688630]  __kasan_check_write+0x44/0x54
[ 3307.692852]  dec_ucount+0x50/0xd8
[ 3307.696266]  free_user_ns+0x1b0/0x288
[ 3307.700036]  process_one_work+0x7b4/0x1130
[ 3307.704251]  worker_thread+0x800/0xcf4
[ 3307.708111]  kthread+0x2a8/0x358
[ 3307.711437]  ret_from_fork+0x10/0x20
[ 3307.715121]
[ 3307.716664] Allocated by task 6564:
[ 3307.720253]  kasan_save_stack+0x38/0x68
[ 3307.724206]  __kasan_kmalloc+0x9c/0xb8
[ 3307.728068]  kmem_cache_alloc_trace+0x260/0x32c
[ 3307.732729]  alloc_ucounts+0x150/0x374
[ 3307.736589]  set_cred_ucounts+0x178/0x240
[ 3307.740714]  __sys_setresuid+0x31c/0x4f8
[ 3307.744754]  __arm64_sys_setresuid+0x84/0x98
[ 3307.749153]  invoke_syscall+0xcc/0x2bc
[ 3307.753012]  el0_svc_common+0x118/0x1ec
[ 3307.756961]  do_el0_svc_compat+0x50/0x60
[ 3307.761005]  el0_svc_compat+0x5c/0xec
[ 3307.764774]  el0t_32_sync_handler+0xc0/0xf0
[ 3307.769083]  el0t_32_sync+0x1a4/0x1a8
[ 3307.772852]
[ 3307.774399] The buggy address belongs to the object at ffffff80a5e4c500
[ 3307.774399]  which belongs to the cache kmalloc-256 of size 256
[ 3307.787250] The buggy address is located 32 bytes inside of
[ 3307.787250]  256-byte region [ffffff80a5e4c500, ffffff80a5e4c600)
[ 3307.799216] The buggy address belongs to the page:
[ 3307.804141] page:fffffffe02979200 refcount:1 mapcount:0
mapping:0000000000000000 index:0xffffff80a5e4c100 pfn:0x125e48
[ 3307.815127] head:fffffffe02979200 order:3 compound_mapcount:0
compound_pincount:0
[ 3307.822808] flags: 0x8000000000010200(slab|head|zone=2)
[ 3307.828187] raw: 8000000000010200 fffffffe0250ba08 fffffffe00f04a08
ffffff808000c980
[ 3307.836148] raw: ffffff80a5e4c100 0000000000200001 00000001ffffffff
0000000000000000
[ 3307.844104] page dumped because: kasan: bad access detected
[ 3307.849837]
[ 3307.851381] Memory state around the buggy address:
[ 3307.856307]  ffffff80a5e4c400: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[ 3307.863729]  ffffff80a5e4c480: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[ 3307.871146] >ffffff80a5e4c500: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[ 3307.878562]                                ^
[ 3307.882956]  ffffff80a5e4c580: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[ 3307.890375]  ffffff80a5e4c600: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[ 3307.897800] ==================================================================

>  include/linux/user_namespace.h |  2 ++
>  kernel/signal.c                | 25 +++++------------
>  kernel/ucount.c                | 49 ++++++++++++++++++++++++++++++++++
>  3 files changed, 57 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index eb70cabe6e7f..33a4240e6a6f 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -127,6 +127,8 @@ static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type t
>
>  long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
>  bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type);
> +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type);
>  bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max);
>
>  static inline void set_rlimit_ucount_max(struct user_namespace *ns,
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a3229add4455..13d2505a14a0 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -425,22 +425,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
>          */
>         rcu_read_lock();
>         ucounts = task_ucounts(t);
> -       sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> -       switch (sigpending) {
> -       case 1:
> -               if (likely(get_ucounts(ucounts)))
> -                       break;
> -               fallthrough;
> -       case LONG_MAX:
> -               /*
> -                * we need to decrease the ucount in the userns tree on any
> -                * failure to avoid counts leaking.
> -                */
> -               dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
> -               rcu_read_unlock();
> -               return NULL;
> -       }
> +       sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
>         rcu_read_unlock();
> +       if (!sigpending)
> +               return NULL;
>
>         if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
>                 q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
> @@ -449,8 +437,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
>         }
>
>         if (unlikely(q == NULL)) {
> -               if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
> -                       put_ucounts(ucounts);
> +               dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
>         } else {
>                 INIT_LIST_HEAD(&q->list);
>                 q->flags = sigqueue_flags;
> @@ -463,8 +450,8 @@ static void __sigqueue_free(struct sigqueue *q)
>  {
>         if (q->flags & SIGQUEUE_PREALLOC)
>                 return;
> -       if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
> -               put_ucounts(q->ucounts);
> +       if (q->ucounts) {
> +               dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING);
>                 q->ucounts = NULL;
>         }
>         kmem_cache_free(sigqueue_cachep, q);
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index bb51849e6375..eb03f3c68375 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -284,6 +284,55 @@ bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
>         return (new == 0);
>  }
>
> +static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
> +                               struct ucounts *last, enum ucount_type type)
> +{
> +       struct ucounts *iter, *next;
> +       for (iter = ucounts; iter != last; iter = next) {
> +               long dec = atomic_long_add_return(-1, &iter->ucount[type]);
> +               WARN_ON_ONCE(dec < 0);
> +               next = iter->ns->ucounts;
> +               if (dec == 0)
> +                       put_ucounts(iter);
> +       }
> +}
> +
> +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type)
> +{
> +       do_dec_rlimit_put_ucounts(ucounts, NULL, type);
> +}
> +
> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type)
> +{
> +       /* Caller must hold a reference to ucounts */
> +       struct ucounts *iter;
> +       long dec, ret = 0;
> +
> +       for (iter = ucounts; iter; iter = iter->ns->ucounts) {
> +               long max = READ_ONCE(iter->ns->ucount_max[type]);
> +               long new = atomic_long_add_return(1, &iter->ucount[type]);
> +               if (new < 0 || new > max)
> +                       goto unwind;
> +               if (iter == ucounts)
> +                       ret = new;
> +               /*
> +                * Grab an extra ucount reference for the caller when
> +                * the rlimit count was previously 0.
> +                */
> +               if (new != 1)
> +                       continue;
> +               if (!get_ucounts(iter))
> +                       goto dec_unwind;
> +       }
> +       return ret;
> +dec_unwind:
> +       dec = atomic_long_add_return(-1, &iter->ucount[type]);
> +       WARN_ON_ONCE(dec < 0);
> +unwind:
> +       do_dec_rlimit_put_ucounts(ucounts, iter, type);
> +       return 0;
> +}
> +
>  bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max)
>  {
>         struct ucounts *iter;
> --
> 2.20.1
>
diff mbox series

Patch

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index eb70cabe6e7f..33a4240e6a6f 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -127,6 +127,8 @@  static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type t
 
 long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
 bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v);
+long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type);
+void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type);
 bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max);
 
 static inline void set_rlimit_ucount_max(struct user_namespace *ns,
diff --git a/kernel/signal.c b/kernel/signal.c
index a3229add4455..13d2505a14a0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -425,22 +425,10 @@  __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
 	 */
 	rcu_read_lock();
 	ucounts = task_ucounts(t);
-	sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
-	switch (sigpending) {
-	case 1:
-		if (likely(get_ucounts(ucounts)))
-			break;
-		fallthrough;
-	case LONG_MAX:
-		/*
-		 * we need to decrease the ucount in the userns tree on any
-		 * failure to avoid counts leaking.
-		 */
-		dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
-		rcu_read_unlock();
-		return NULL;
-	}
+	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
 	rcu_read_unlock();
+	if (!sigpending)
+		return NULL;
 
 	if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) {
 		q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
@@ -449,8 +437,7 @@  __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
 	}
 
 	if (unlikely(q == NULL)) {
-		if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1))
-			put_ucounts(ucounts);
+		dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
 	} else {
 		INIT_LIST_HEAD(&q->list);
 		q->flags = sigqueue_flags;
@@ -463,8 +450,8 @@  static void __sigqueue_free(struct sigqueue *q)
 {
 	if (q->flags & SIGQUEUE_PREALLOC)
 		return;
-	if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) {
-		put_ucounts(q->ucounts);
+	if (q->ucounts) {
+		dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING);
 		q->ucounts = NULL;
 	}
 	kmem_cache_free(sigqueue_cachep, q);
diff --git a/kernel/ucount.c b/kernel/ucount.c
index bb51849e6375..eb03f3c68375 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -284,6 +284,55 @@  bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
 	return (new == 0);
 }
 
+static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
+				struct ucounts *last, enum ucount_type type)
+{
+	struct ucounts *iter, *next;
+	for (iter = ucounts; iter != last; iter = next) {
+		long dec = atomic_long_add_return(-1, &iter->ucount[type]);
+		WARN_ON_ONCE(dec < 0);
+		next = iter->ns->ucounts;
+		if (dec == 0)
+			put_ucounts(iter);
+	}
+}
+
+void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type)
+{
+	do_dec_rlimit_put_ucounts(ucounts, NULL, type);
+}
+
+long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type)
+{
+	/* Caller must hold a reference to ucounts */
+	struct ucounts *iter;
+	long dec, ret = 0;
+
+	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
+		long max = READ_ONCE(iter->ns->ucount_max[type]);
+		long new = atomic_long_add_return(1, &iter->ucount[type]);
+		if (new < 0 || new > max)
+			goto unwind;
+		if (iter == ucounts)
+			ret = new;
+		/*
+		 * Grab an extra ucount reference for the caller when
+		 * the rlimit count was previously 0.
+		 */
+		if (new != 1)
+			continue;
+		if (!get_ucounts(iter))
+			goto dec_unwind;
+	}
+	return ret;
+dec_unwind:
+	dec = atomic_long_add_return(-1, &iter->ucount[type]);
+	WARN_ON_ONCE(dec < 0);
+unwind:
+	do_dec_rlimit_put_ucounts(ucounts, iter, type);
+	return 0;
+}
+
 bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max)
 {
 	struct ucounts *iter;