diff mbox series

selinux,xfrm: fix dangling refcount on deferred skb free

Message ID 20241106155509.1706965-1-omosnace@redhat.com (mailing list archive)
State New
Delegated to: Netdev Maintainers
Headers show
Series selinux,xfrm: fix dangling refcount on deferred skb free | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Ondrej Mosnacek Nov. 6, 2024, 3:55 p.m. UTC
SELinux tracks the number of allocated xfrm_state/xfrm_policy objects
(via the selinux_xfrm_refcount variable) as an input in deciding if peer
labeling should be used.

However, as a result of commits f35f821935d8 ("tcp: defer skb freeing
after socket lock is released") and 68822bdf76f1 ("net: generalize skb
freeing deferral to per-cpu lists"), freeing of a sk_buff object, which
may hold a reference to an xfrm_state object, can be deferred for
processing on another CPU core, so even after xfrm_state is deleted from
the configuration by userspace, the refcount isn't decremented until the
deferred freeing of relevant sk_buffs happens. On a system with many
cores this can take a very long time (even minutes or more if the system
is not very active), leading to peer labeling being enabled for much
longer than expected.

Fix this by moving the selinux_xfrm_refcount decrementing to just after
the actual deletion of the xfrm objects rather than waiting for the
freeing to happen. For xfrm_policy it currently doesn't seem to be
necessary, but let's do the same there for consistency and
future-proofing.

We hit this issue on a specific aarch64 256-core system, where the
sequence of unix_socket/test and inet_socket/tcp/test from
selinux-testsuite [1] would quite reliably trigger this scenario, and a
subsequent sctp/test run would then stumble because the policy for that
test misses some rules that would make it work under peer labeling
enabled (namely it was getting the netif::egress permission denied in
some of the test cases).

[1] https://github.com/SELinuxProject/selinux-testsuite/

Fixes: f35f821935d8 ("tcp: defer skb freeing after socket lock is released")
Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
---
 include/linux/lsm_hook_defs.h   |  2 ++
 include/linux/security.h        | 10 ++++++++++
 net/xfrm/xfrm_policy.c          |  1 +
 net/xfrm/xfrm_state.c           |  2 ++
 security/security.c             | 26 ++++++++++++++++++++++++++
 security/selinux/hooks.c        |  2 ++
 security/selinux/include/xfrm.h |  2 ++
 security/selinux/xfrm.c         | 17 ++++++++++++++++-
 8 files changed, 61 insertions(+), 1 deletion(-)

Comments

Eric Dumazet Nov. 6, 2024, 4:13 p.m. UTC | #1
On Wed, Nov 6, 2024 at 4:55 PM Ondrej Mosnacek <omosnace@redhat.com> wrote:
>
> SELinux tracks the number of allocated xfrm_state/xfrm_policy objects
> (via the selinux_xfrm_refcount variable) as an input in deciding if peer
> labeling should be used.
>
> However, as a result of commits f35f821935d8 ("tcp: defer skb freeing
> after socket lock is released") and 68822bdf76f1 ("net: generalize skb
> freeing deferral to per-cpu lists"), freeing of a sk_buff object, which
> may hold a reference to an xfrm_state object, can be deferred for
> processing on another CPU core, so even after xfrm_state is deleted from
> the configuration by userspace, the refcount isn't decremented until the
> deferred freeing of relevant sk_buffs happens. On a system with many
> cores this can take a very long time (even minutes or more if the system
> is not very active), leading to peer labeling being enabled for much
> longer than expected.
>
> Fix this by moving the selinux_xfrm_refcount decrementing to just after
> the actual deletion of the xfrm objects rather than waiting for the
> freeing to happen. For xfrm_policy it currently doesn't seem to be
> necessary, but let's do the same there for consistency and
> future-proofing.
>
> We hit this issue on a specific aarch64 256-core system, where the
> sequence of unix_socket/test and inet_socket/tcp/test from
> selinux-testsuite [1] would quite reliably trigger this scenario, and a
> subsequent sctp/test run would then stumble because the policy for that
> test misses some rules that would make it work under peer labeling
> enabled (namely it was getting the netif::egress permission denied in
> some of the test cases).
>
> [1] https://github.com/SELinuxProject/selinux-testsuite/
>
> Fixes: f35f821935d8 ("tcp: defer skb freeing after socket lock is released")
> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
> ---

Can we explain why TCP packets sitting in TCP receive queues would
need to keep xfrm_state around ?

With thousands of TCP sockets. I would imagine that a similar issue
would be hit,
regardless of f35f821935d8 ("tcp: defer skb freeing after socket lock
is released") and 68822bdf76f1 ("net: generalize skb freeing deferral
to per-cpu lists")

We remove the dst from these incoming packets (see skb_dst_drop() in
tcp_data_queue() and tcp_add_backlog()),
I do not see how XFRM state could be kept ?
Ondrej Mosnacek Nov. 6, 2024, 4:54 p.m. UTC | #2
On Wed, Nov 6, 2024 at 5:13 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Nov 6, 2024 at 4:55 PM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> >
> > SELinux tracks the number of allocated xfrm_state/xfrm_policy objects
> > (via the selinux_xfrm_refcount variable) as an input in deciding if peer
> > labeling should be used.
> >
> > However, as a result of commits f35f821935d8 ("tcp: defer skb freeing
> > after socket lock is released") and 68822bdf76f1 ("net: generalize skb
> > freeing deferral to per-cpu lists"), freeing of a sk_buff object, which
> > may hold a reference to an xfrm_state object, can be deferred for
> > processing on another CPU core, so even after xfrm_state is deleted from
> > the configuration by userspace, the refcount isn't decremented until the
> > deferred freeing of relevant sk_buffs happens. On a system with many
> > cores this can take a very long time (even minutes or more if the system
> > is not very active), leading to peer labeling being enabled for much
> > longer than expected.
> >
> > Fix this by moving the selinux_xfrm_refcount decrementing to just after
> > the actual deletion of the xfrm objects rather than waiting for the
> > freeing to happen. For xfrm_policy it currently doesn't seem to be
> > necessary, but let's do the same there for consistency and
> > future-proofing.
> >
> > We hit this issue on a specific aarch64 256-core system, where the
> > sequence of unix_socket/test and inet_socket/tcp/test from
> > selinux-testsuite [1] would quite reliably trigger this scenario, and a
> > subsequent sctp/test run would then stumble because the policy for that
> > test misses some rules that would make it work under peer labeling
> > enabled (namely it was getting the netif::egress permission denied in
> > some of the test cases).
> >
> > [1] https://github.com/SELinuxProject/selinux-testsuite/
> >
> > Fixes: f35f821935d8 ("tcp: defer skb freeing after socket lock is released")
> > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> > Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
> > ---
>
> Can we explain why TCP packets sitting in TCP receive queues would
> need to keep xfrm_state around ?
>
> With thousands of TCP sockets. I would imagine that a similar issue
> would be hit,
> regardless of f35f821935d8 ("tcp: defer skb freeing after socket lock
> is released") and 68822bdf76f1 ("net: generalize skb freeing deferral
> to per-cpu lists")
>
> We remove the dst from these incoming packets (see skb_dst_drop() in
> tcp_data_queue() and tcp_add_backlog()),
> I do not see how XFRM state could be kept ?

The problem is not with the xfrm_state reference via dst_entry, but
the one in skb_ext (skb->extensions) -> sec_path
(skb_ext_get_ptr(skb->extensions, SKB_EXT_SEC_PATH)) -> the xvec
array. But you have a point that I should say that in the commit
message...

And I think you're right that even without those commits a delay in
processing the RCU free callbacks could cause a similar issue, it just
wouldn't be as easy to trigger. That made me look deeper into history
which commit actually added the decrement on free and it turns out it
was done intentionally as a bugfix - see commit e4e8536f65b5
("selinux: fix the labeled xfrm/IPsec reference count handling").
Before that commit the logic was similar to what my patch is doing, so
I could be re-introducing another bug here :-/ The commit message is
not very helpful there - Paul, do you happen to remember what the
issue was that prompted it? I guess there can be some alloc's that
won't have a matching delete in the right circumstances? Or something
involving the selinux_xfrm_policy_clone() case?
Eric Dumazet Nov. 6, 2024, 5 p.m. UTC | #3
On Wed, Nov 6, 2024 at 5:54 PM Ondrej Mosnacek <omosnace@redhat.com> wrote:
>
> On Wed, Nov 6, 2024 at 5:13 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Nov 6, 2024 at 4:55 PM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> > >
> > > SELinux tracks the number of allocated xfrm_state/xfrm_policy objects
> > > (via the selinux_xfrm_refcount variable) as an input in deciding if peer
> > > labeling should be used.
> > >
> > > However, as a result of commits f35f821935d8 ("tcp: defer skb freeing
> > > after socket lock is released") and 68822bdf76f1 ("net: generalize skb
> > > freeing deferral to per-cpu lists"), freeing of a sk_buff object, which
> > > may hold a reference to an xfrm_state object, can be deferred for
> > > processing on another CPU core, so even after xfrm_state is deleted from
> > > the configuration by userspace, the refcount isn't decremented until the
> > > deferred freeing of relevant sk_buffs happens. On a system with many
> > > cores this can take a very long time (even minutes or more if the system
> > > is not very active), leading to peer labeling being enabled for much
> > > longer than expected.
> > >
> > > Fix this by moving the selinux_xfrm_refcount decrementing to just after
> > > the actual deletion of the xfrm objects rather than waiting for the
> > > freeing to happen. For xfrm_policy it currently doesn't seem to be
> > > necessary, but let's do the same there for consistency and
> > > future-proofing.
> > >
> > > We hit this issue on a specific aarch64 256-core system, where the
> > > sequence of unix_socket/test and inet_socket/tcp/test from
> > > selinux-testsuite [1] would quite reliably trigger this scenario, and a
> > > subsequent sctp/test run would then stumble because the policy for that
> > > test misses some rules that would make it work under peer labeling
> > > enabled (namely it was getting the netif::egress permission denied in
> > > some of the test cases).
> > >
> > > [1] https://github.com/SELinuxProject/selinux-testsuite/
> > >
> > > Fixes: f35f821935d8 ("tcp: defer skb freeing after socket lock is released")
> > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> > > Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
> > > ---
> >
> > Can we explain why TCP packets sitting in TCP receive queues would
> > need to keep xfrm_state around ?
> >
> > With thousands of TCP sockets. I would imagine that a similar issue
> > would be hit,
> > regardless of f35f821935d8 ("tcp: defer skb freeing after socket lock
> > is released") and 68822bdf76f1 ("net: generalize skb freeing deferral
> > to per-cpu lists")
> >
> > We remove the dst from these incoming packets (see skb_dst_drop() in
> > tcp_data_queue() and tcp_add_backlog()),
> > I do not see how XFRM state could be kept ?
>
> The problem is not with the xfrm_state reference via dst_entry, but
> the one in skb_ext (skb->extensions) -> sec_path
> (skb_ext_get_ptr(skb->extensions, SKB_EXT_SEC_PATH)) -> the xvec
> array. But you have a point that I should say that in the commit
> message...
>

So some secpath_reset() calls should be added in various protocol handlers
before packets are possibly queued for hours in a socket queue  ?

I see one in l2tp_eth_dev_recv().

> And I think you're right that even without those commits a delay in
> processing the RCU free callbacks could cause a similar issue, it just
> wouldn't be as easy to trigger. That made me look deeper into history
> which commit actually added the decrement on free and it turns out it
> was done intentionally as a bugfix - see commit e4e8536f65b5
> ("selinux: fix the labeled xfrm/IPsec reference count handling").
> Before that commit the logic was similar to what my patch is doing, so
> I could be re-introducing another bug here :-/ The commit message is
> not very helpful there - Paul, do you happen to remember what the
> issue was that prompted it? I guess there can be some alloc's that
> won't have a matching delete in the right circumstances? Or something
> involving the selinux_xfrm_policy_clone() case?
>
> --
> Ondrej Mosnacek
> Senior Software Engineer, Linux Security - SELinux kernel
> Red Hat, Inc.
>
diff mbox series

Patch

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 9eca013aa5e1f..c1f58a74a64ba 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -387,12 +387,14 @@  LSM_HOOK(int, 0, xfrm_policy_clone_security, struct xfrm_sec_ctx *old_ctx,
 LSM_HOOK(void, LSM_RET_VOID, xfrm_policy_free_security,
 	 struct xfrm_sec_ctx *ctx)
 LSM_HOOK(int, 0, xfrm_policy_delete_security, struct xfrm_sec_ctx *ctx)
+LSM_HOOK(void, LSM_RET_VOID, xfrm_policy_deleted, struct xfrm_sec_ctx *ctx)
 LSM_HOOK(int, 0, xfrm_state_alloc, struct xfrm_state *x,
 	 struct xfrm_user_sec_ctx *sec_ctx)
 LSM_HOOK(int, 0, xfrm_state_alloc_acquire, struct xfrm_state *x,
 	 struct xfrm_sec_ctx *polsec, u32 secid)
 LSM_HOOK(void, LSM_RET_VOID, xfrm_state_free_security, struct xfrm_state *x)
 LSM_HOOK(int, 0, xfrm_state_delete_security, struct xfrm_state *x)
+LSM_HOOK(void, LSM_RET_VOID, xfrm_state_deleted, struct xfrm_state *x)
 LSM_HOOK(int, 0, xfrm_policy_lookup, struct xfrm_sec_ctx *ctx, u32 fl_secid)
 LSM_HOOK(int, 1, xfrm_state_pol_flow_match, struct xfrm_state *x,
 	 struct xfrm_policy *xp, const struct flowi_common *flic)
diff --git a/include/linux/security.h b/include/linux/security.h
index b86ec2afc6910..ac1f85d0f1110 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1873,10 +1873,12 @@  int security_xfrm_policy_alloc(struct xfrm_sec_ctx **ctxp,
 int security_xfrm_policy_clone(struct xfrm_sec_ctx *old_ctx, struct xfrm_sec_ctx **new_ctxp);
 void security_xfrm_policy_free(struct xfrm_sec_ctx *ctx);
 int security_xfrm_policy_delete(struct xfrm_sec_ctx *ctx);
+void security_xfrm_policy_deleted(struct xfrm_sec_ctx *ctx);
 int security_xfrm_state_alloc(struct xfrm_state *x, struct xfrm_user_sec_ctx *sec_ctx);
 int security_xfrm_state_alloc_acquire(struct xfrm_state *x,
 				      struct xfrm_sec_ctx *polsec, u32 secid);
 int security_xfrm_state_delete(struct xfrm_state *x);
+void security_xfrm_state_deleted(struct xfrm_state *x);
 void security_xfrm_state_free(struct xfrm_state *x);
 int security_xfrm_policy_lookup(struct xfrm_sec_ctx *ctx, u32 fl_secid);
 int security_xfrm_state_pol_flow_match(struct xfrm_state *x,
@@ -1908,6 +1910,10 @@  static inline int security_xfrm_policy_delete(struct xfrm_sec_ctx *ctx)
 	return 0;
 }
 
+static inline void security_xfrm_policy_deleted(struct xfrm_sec_ctx *ctx)
+{
+}
+
 static inline int security_xfrm_state_alloc(struct xfrm_state *x,
 					struct xfrm_user_sec_ctx *sec_ctx)
 {
@@ -1929,6 +1935,10 @@  static inline int security_xfrm_state_delete(struct xfrm_state *x)
 	return 0;
 }
 
+static inline void security_xfrm_state_deleted(struct xfrm_state *x)
+{
+}
+
 static inline int security_xfrm_policy_lookup(struct xfrm_sec_ctx *ctx, u32 fl_secid)
 {
 	return 0;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 914bac03b52ad..1433520c62c94 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2313,6 +2313,7 @@  static struct xfrm_policy *__xfrm_policy_unlink(struct xfrm_policy *pol,
 	list_del_init(&pol->walk.all);
 	net->xfrm.policy_count[dir]--;
 
+	security_xfrm_policy_deleted(pol->security);
 	return pol;
 }
 
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 37478d36a8dff..80f5006bc414e 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -760,6 +760,8 @@  int __xfrm_state_delete(struct xfrm_state *x)
 
 		xfrm_dev_state_delete(x);
 
+		security_xfrm_state_deleted(x);
+
 		/* All xfrm_state objects are created by xfrm_state_alloc.
 		 * The xfrm_state_alloc call gives a reference, and that
 		 * is what we are dropping here.
diff --git a/security/security.c b/security/security.c
index 6875eb4a59fcc..f6a985417f6f8 100644
--- a/security/security.c
+++ b/security/security.c
@@ -5295,6 +5295,19 @@  int security_xfrm_policy_delete(struct xfrm_sec_ctx *ctx)
 	return call_int_hook(xfrm_policy_delete_security, ctx);
 }
 
+/**
+ * security_xfrm_policy_deleted() - Handle deletion of xfrm policy
+ * @ctx: xfrm security context
+ *
+ * Handle deletion of xfrm policy. This is called on the actual deletion
+ * of the policy from the xfrm system. References to the policy may be
+ * still held elsewhere, so resources should not be freed yet.
+ */
+void security_xfrm_policy_deleted(struct xfrm_sec_ctx *ctx)
+{
+	call_void_hook(xfrm_policy_deleted, ctx);
+}
+
 /**
  * security_xfrm_state_alloc() - Allocate a xfrm state LSM blob
  * @x: xfrm state being added to the SAD
@@ -5345,6 +5358,19 @@  int security_xfrm_state_delete(struct xfrm_state *x)
 }
 EXPORT_SYMBOL(security_xfrm_state_delete);
 
+/**
+ * security_xfrm_state_deleted() - Handle deletion of xfrm state
+ * @x: xfrm state
+ *
+ * Handle deletion of xfrm state. This is called on the actual deletion
+ * of the state from the xfrm system. References to the state may be
+ * still held elsewhere, so resources should not be freed yet.
+ */
+void security_xfrm_state_deleted(struct xfrm_state *x)
+{
+	call_void_hook(xfrm_state_deleted, x);
+}
+
 /**
  * security_xfrm_state_free() - Free a xfrm state
  * @x: xfrm state
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index ad3abd48eed1d..d3ade56c09e8f 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -7315,8 +7315,10 @@  static struct security_hook_list selinux_hooks[] __ro_after_init = {
 #ifdef CONFIG_SECURITY_NETWORK_XFRM
 	LSM_HOOK_INIT(xfrm_policy_free_security, selinux_xfrm_policy_free),
 	LSM_HOOK_INIT(xfrm_policy_delete_security, selinux_xfrm_policy_delete),
+	LSM_HOOK_INIT(xfrm_policy_deleted, selinux_xfrm_policy_deleted),
 	LSM_HOOK_INIT(xfrm_state_free_security, selinux_xfrm_state_free),
 	LSM_HOOK_INIT(xfrm_state_delete_security, selinux_xfrm_state_delete),
+	LSM_HOOK_INIT(xfrm_state_deleted, selinux_xfrm_state_deleted),
 	LSM_HOOK_INIT(xfrm_policy_lookup, selinux_xfrm_policy_lookup),
 	LSM_HOOK_INIT(xfrm_state_pol_flow_match,
 			selinux_xfrm_state_pol_flow_match),
diff --git a/security/selinux/include/xfrm.h b/security/selinux/include/xfrm.h
index de485556ae29c..bde5a9e2ccf95 100644
--- a/security/selinux/include/xfrm.h
+++ b/security/selinux/include/xfrm.h
@@ -19,12 +19,14 @@  int selinux_xfrm_policy_clone(struct xfrm_sec_ctx *old_ctx,
 			      struct xfrm_sec_ctx **new_ctxp);
 void selinux_xfrm_policy_free(struct xfrm_sec_ctx *ctx);
 int selinux_xfrm_policy_delete(struct xfrm_sec_ctx *ctx);
+void selinux_xfrm_policy_deleted(struct xfrm_sec_ctx *ctx);
 int selinux_xfrm_state_alloc(struct xfrm_state *x,
 			     struct xfrm_user_sec_ctx *uctx);
 int selinux_xfrm_state_alloc_acquire(struct xfrm_state *x,
 				     struct xfrm_sec_ctx *polsec, u32 secid);
 void selinux_xfrm_state_free(struct xfrm_state *x);
 int selinux_xfrm_state_delete(struct xfrm_state *x);
+void selinux_xfrm_state_deleted(struct xfrm_state *x);
 int selinux_xfrm_policy_lookup(struct xfrm_sec_ctx *ctx, u32 fl_secid);
 int selinux_xfrm_state_pol_flow_match(struct xfrm_state *x,
 				      struct xfrm_policy *xp,
diff --git a/security/selinux/xfrm.c b/security/selinux/xfrm.c
index 90ec4ef1b082f..35372bdba7279 100644
--- a/security/selinux/xfrm.c
+++ b/security/selinux/xfrm.c
@@ -124,7 +124,6 @@  static void selinux_xfrm_free(struct xfrm_sec_ctx *ctx)
 	if (!ctx)
 		return;
 
-	atomic_dec(&selinux_xfrm_refcount);
 	kfree(ctx);
 }
 
@@ -321,6 +320,14 @@  int selinux_xfrm_policy_delete(struct xfrm_sec_ctx *ctx)
 	return selinux_xfrm_delete(ctx);
 }
 
+/*
+ * LSM hook implementation that handles deletion of labeled SAs.
+ */
+void selinux_xfrm_policy_deleted(struct xfrm_sec_ctx *ctx)
+{
+	atomic_dec(&selinux_xfrm_refcount);
+}
+
 /*
  * LSM hook implementation that allocates a xfrm_sec_state, populates it using
  * the supplied security context, and assigns it to the xfrm_state.
@@ -389,6 +396,14 @@  int selinux_xfrm_state_delete(struct xfrm_state *x)
 	return selinux_xfrm_delete(x->security);
 }
 
+/*
+ * LSM hook implementation that handles deletion of labeled SAs.
+ */
+void selinux_xfrm_state_deleted(struct xfrm_state *x)
+{
+	atomic_dec(&selinux_xfrm_refcount);
+}
+
 /*
  * LSM hook that controls access to unlabelled packets.  If
  * a xfrm_state is authorizable (defined by macro) then it was