diff mbox series

[mptcp-next,v5,5/8] mptcp: netlink: store per namespace list of refcounted listen socks

Message ID 20220203072508.3072309-6-kishen.maloor@intel.com (mailing list archive)
State Superseded, archived
Headers show
Series mptcp: fixes and enhancements related to path management | expand

Checks

Context Check Description
matttbe/checkpatch success total: 0 errors, 0 warnings, 0 checks, 107 lines checked
matttbe/build fail Build error with: -Werror
matttbe/KVM_Validation__normal warning Unstable: 1 failed test(s): selftest_mptcp_join
matttbe/KVM_Validation__debug warning Unstable: 2 failed test(s): selftest_diag selftest_mptcp_join

Commit Message

Kishen Maloor Feb. 3, 2022, 7:25 a.m. UTC
The kernel can create listening sockets bound to announced addresses
via the ADD_ADDR option for receiving MP_JOIN requests. Path
managers may further choose to advertise the same addr+port over multiple
MPTCP connections. So this change provides a simple framework to
manage a list of all distinct listning sockets created in the kernel
over a namespace by encapsulating the socket in a structure that is
ref counted and can be shared across multiple connections. The sockets
are released when there are no more references.

Signed-off-by: Kishen Maloor <kishen.maloor@intel.com>
---
v2: fixed formatting
---
 net/mptcp/pm_netlink.c | 76 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

Comments

Florian Westphal Feb. 3, 2022, 5:46 p.m. UTC | #1
Kishen Maloor <kishen.maloor@intel.com> wrote:
> The kernel can create listening sockets bound to announced addresses
> via the ADD_ADDR option for receiving MP_JOIN requests. Path
> managers may further choose to advertise the same addr+port over multiple
> MPTCP connections. So this change provides a simple framework to
> manage a list of all distinct listning sockets created in the kernel
> over a namespace by encapsulating the socket in a structure that is
> ref counted and can be shared across multiple connections. The sockets
> are released when there are no more references.

I think it makes sense to work on a hook in tcp v4/v6 input path
that gets called for th->syn && !th->ack && no-listener-found case.

The hook would:
1. retrieve join token, fetch mptcp_sock and allow 3whs to continue
   if things look ok from mptcp p.o.v.
2. return "go ahead and send tcp rst" or "mptcp magic, skb stolen"
to the tcp stack.

This also makes sure that plain tcp or mptcp connect requests will
not work for addresses that did not go through socket/bind/listen API.

I will try to prototype something next week.

Given that hook lives in an error path (from tcp point of view)
I think its going to be OK from a upstreaming point of view.

It hopefully avoids the need for "magic listener sockets", and avoids
kernel fighting with userspace applications over which address:port
pairs are really useable.

The latter is a concern IMO, esp. with reuseport and other round-robin
schemes, I don't want mptcp layer to interfere with other application
running on same host.
Kishen Maloor Feb. 3, 2022, 8:09 p.m. UTC | #2
On 2/3/22 9:46 AM, Florian Westphal wrote:
> Kishen Maloor <kishen.maloor@intel.com> wrote:
>> The kernel can create listening sockets bound to announced addresses
>> via the ADD_ADDR option for receiving MP_JOIN requests. Path
>> managers may further choose to advertise the same addr+port over multiple
>> MPTCP connections. So this change provides a simple framework to
>> manage a list of all distinct listning sockets created in the kernel
>> over a namespace by encapsulating the socket in a structure that is
>> ref counted and can be shared across multiple connections. The sockets
>> are released when there are no more references.
> 
> I think it makes sense to work on a hook in tcp v4/v6 input path
> that gets called for th->syn && !th->ack && no-listener-found case.
> 
> The hook would:
> 1. retrieve join token, fetch mptcp_sock and allow 3whs to continue
>    if things look ok from mptcp p.o.v.
> 2. return "go ahead and send tcp rst" or "mptcp magic, skb stolen"
> to the tcp stack.
> 
> This also makes sure that plain tcp or mptcp connect requests will
> not work for addresses that did not go through socket/bind/listen API.
> 
> I will try to prototype something next week.
> 

Sounds good.

> Given that hook lives in an error path (from tcp point of view)
> I think its going to be OK from a upstreaming point of view.
> 
> It hopefully avoids the need for "magic listener sockets", and avoids
> kernel fighting with userspace applications over which address:port
> pairs are really useable.
> 

Will this also obviate the need for listeners we currently create for port-based
endpoints?

> The latter is a concern IMO, esp. with reuseport and other round-robin
> schemes, I don't want mptcp layer to interfere with other application
> running on same host.

Indeed if there are active/legacy TCP deployments that cannot be reconfigured with the 
NO_LISTEN flag, then we could choose to stick with the current default behavior
and introduce a LISTEN flag (and additionally a NO_LISTEN flag to not create listeners for
port-based endpoints as discussed earlier today). Further, if it's possible, we could 
also update the MPTCP layer to not accept MPC attempts over listeners created in the 
kernel to address that matter?

My only immediate concern (on the surface) with your proposal (which may not actually
be a worry upon assessment) is any potential risks of spurious MPJ
attempts over what might be a "slow"(?) error path - just an early thought for 
consideration. But look forward to seeing your changes!

Cheers,
-Kishen.
Florian Westphal Feb. 3, 2022, 8:35 p.m. UTC | #3
Kishen Maloor <kishen.maloor@intel.com> wrote:
> > Given that hook lives in an error path (from tcp point of view)
> > I think its going to be OK from a upstreaming point of view.
> > 
> > It hopefully avoids the need for "magic listener sockets", and avoids
> > kernel fighting with userspace applications over which address:port
> > pairs are really useable.
> > 
> 
> Will this also obviate the need for listeners we currently create for port-based
> endpoints?

Hopefully yes.

> Indeed if there are active/legacy TCP deployments that cannot be reconfigured with the 
> NO_LISTEN flag, then we could choose to stick with the current default behavior
> and introduce a LISTEN flag (and additionally a NO_LISTEN flag to not create listeners for
> port-based endpoints as discussed earlier today). Further, if it's possible, we could 
> also update the MPTCP layer to not accept MPC attempts over listeners created in the 
> kernel to address that matter?

Yes, we could do that, I suggest to wait and see how the "syn/join hook"
works out.
Mat Martineau Feb. 4, 2022, 1:02 a.m. UTC | #4
On Thu, 3 Feb 2022, Florian Westphal wrote:

> Kishen Maloor <kishen.maloor@intel.com> wrote:
>> The kernel can create listening sockets bound to announced addresses
>> via the ADD_ADDR option for receiving MP_JOIN requests. Path
>> managers may further choose to advertise the same addr+port over multiple
>> MPTCP connections. So this change provides a simple framework to
>> manage a list of all distinct listning sockets created in the kernel
>> over a namespace by encapsulating the socket in a structure that is
>> ref counted and can be shared across multiple connections. The sockets
>> are released when there are no more references.
>
> I think it makes sense to work on a hook in tcp v4/v6 input path
> that gets called for th->syn && !th->ack && no-listener-found case.
>
> The hook would:
> 1. retrieve join token, fetch mptcp_sock and allow 3whs to continue
>   if things look ok from mptcp p.o.v.
> 2. return "go ahead and send tcp rst" or "mptcp magic, skb stolen"
> to the tcp stack.
>

This is basically the approach the multipath-tcp.org kernel takes. I think 
we initially decided to use listening sockets instead since it was less 
invasive, but now we have found the limits of the listener approach.

It even looks like your proposal adheres more closely to 
https://datatracker.ietf.org/doc/html/rfc8684.html#section-3.2 (last 
paragraph): "Demultiplexing subflow SYNs MUST be done using the token"

I hadn't noticed that "MUST" before.

Do you think the token lookups should depend on anything other than the 
token value and net namespace - for example, to make sure someone can't 
use a public interface to try to brute-force potential tokens in use on 
other interfaces? The HMACs later in the handshake would guard against an 
actual connection (but would burn some extra CPU cycles). I guess this 
isn't a fundamentally different problem than we have today if there are 
any MPTCP listeners on public interfaces, it just means they don't have to 
work out the port number first.

> This also makes sure that plain tcp or mptcp connect requests will
> not work for addresses that did not go through socket/bind/listen API.
>
> I will try to prototype something next week.
>

Thanks!

> Given that hook lives in an error path (from tcp point of view)
> I think its going to be OK from a upstreaming point of view.
>

Seems reasonable to me.

> It hopefully avoids the need for "magic listener sockets", and avoids
> kernel fighting with userspace applications over which address:port
> pairs are really useable.
>
> The latter is a concern IMO, esp. with reuseport and other round-robin
> schemes, I don't want mptcp layer to interfere with other application
> running on same host.

The complexity of the magic listener sockets - creating and managing them, 
as well as mysterious interactions with userspace - does seem more 
significant than the TCP changes, so I hope the prototype works out.


--
Mat Martineau
Intel
Paolo Abeni Feb. 4, 2022, 9:47 a.m. UTC | #5
On Thu, 2022-02-03 at 17:02 -0800, Mat Martineau wrote:
> On Thu, 3 Feb 2022, Florian Westphal wrote:
> 
> > Kishen Maloor <kishen.maloor@intel.com> wrote:
> > > The kernel can create listening sockets bound to announced addresses
> > > via the ADD_ADDR option for receiving MP_JOIN requests. Path
> > > managers may further choose to advertise the same addr+port over multiple
> > > MPTCP connections. So this change provides a simple framework to
> > > manage a list of all distinct listning sockets created in the kernel
> > > over a namespace by encapsulating the socket in a structure that is
> > > ref counted and can be shared across multiple connections. The sockets
> > > are released when there are no more references.
> > 
> > I think it makes sense to work on a hook in tcp v4/v6 input path
> > that gets called for th->syn && !th->ack && no-listener-found case.
> > 
> > The hook would:
> > 1. retrieve join token, fetch mptcp_sock and allow 3whs to continue
> >   if things look ok from mptcp p.o.v.
> > 2. return "go ahead and send tcp rst" or "mptcp magic, skb stolen"
> > to the tcp stack.
> > 
> 
> This is basically the approach the multipath-tcp.org kernel takes. I think 
> we initially decided to use listening sockets instead since it was less 
> invasive, but now we have found the limits of the listener approach.
> 
> It even looks like your proposal adheres more closely to 
> https://datatracker.ietf.org/doc/html/rfc8684.html#section-3.2 (last 
> paragraph): "Demultiplexing subflow SYNs MUST be done using the token"
> 
> I hadn't noticed that "MUST" before.
> 
> Do you think the token lookups should depend on anything other than the 
> token value and net namespace - for example, to make sure someone can't 
> use a public interface to try to brute-force potential tokens in use on 
> other interfaces? The HMACs later in the handshake would guard against an 
> actual connection (but would burn some extra CPU cycles). I guess this 
> isn't a fundamentally different problem than we have today if there are 
> any MPTCP listeners on public interfaces, it just means they don't have to 
> work out the port number first.

The matching MPJ token could/should still be validated vs (signal)
endpoints available in the relevant ns. Attackers should be able to
brute force only via IPs that are reachables and announced as signal
endpoints. It should be safe IMHO.

Overall LGTM

Thanks!

Paolo
diff mbox series

Patch

diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index ff13012178ae..3d6251baef26 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -22,6 +22,14 @@  static struct genl_family mptcp_genl_family;
 
 static int pm_nl_pernet_id;
 
+struct mptcp_local_lsk {
+	struct list_head	list;
+	struct mptcp_addr_info	addr;
+	struct socket		*lsk;
+	struct rcu_head		rcu;
+	refcount_t		refcount;
+};
+
 struct mptcp_pm_addr_entry {
 	struct list_head	list;
 	struct mptcp_addr_info	addr;
@@ -41,7 +49,10 @@  struct mptcp_pm_add_entry {
 struct pm_nl_pernet {
 	/* protects pernet updates */
 	spinlock_t		lock;
+	/* protects access to pernet lsk list */
+	spinlock_t		lsk_list_lock;
 	struct list_head	local_addr_list;
+	struct list_head	lsk_list;
 	unsigned int		addrs;
 	unsigned int		stale_loss_cnt;
 	unsigned int		add_addr_signal_max;
@@ -83,6 +94,69 @@  static bool addresses_equal(const struct mptcp_addr_info *a,
 	return a->port == b->port;
 }
 
+static struct mptcp_local_lsk *lsk_list_find(struct pm_nl_pernet *pernet,
+					     struct mptcp_addr_info *addr)
+{
+	struct mptcp_local_lsk *lsk_ref = NULL;
+	struct mptcp_local_lsk *i;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(i, &pernet->lsk_list, list) {
+		if (addresses_equal(&i->addr, addr, true)) {
+			if (refcount_inc_not_zero(&i->refcount)) {
+				lsk_ref = i;
+				break;
+			}
+		}
+	}
+
+	rcu_read_unlock();
+
+	return lsk_ref;
+}
+
+static void lsk_list_add_ref(struct mptcp_local_lsk *lsk_ref)
+{
+	refcount_inc(&lsk_ref->refcount);
+}
+
+static struct mptcp_local_lsk *lsk_list_add(struct pm_nl_pernet *pernet,
+					    struct mptcp_addr_info *addr,
+					    struct socket *lsk)
+{
+	struct mptcp_local_lsk *lsk_ref;
+
+	lsk_ref = kmalloc(sizeof(*lsk_ref), GFP_ATOMIC);
+
+	if (!lsk_ref)
+		return NULL;
+
+	lsk_ref->lsk = lsk;
+	memcpy(&lsk_ref->addr, addr, sizeof(struct mptcp_addr_info));
+	refcount_set(&lsk_ref->refcount, 1);
+
+	spin_lock_bh(&pernet->lsk_list_lock);
+	list_add_rcu(&lsk_ref->list, &pernet->lsk_list);
+	spin_unlock_bh(&pernet->lsk_list_lock);
+
+	return lsk_ref;
+}
+
+static void lsk_list_release(struct pm_nl_pernet *pernet,
+			     struct mptcp_local_lsk *lsk_ref)
+{
+	if (lsk_ref && refcount_dec_and_test(&lsk_ref->refcount)) {
+		sock_release(lsk_ref->lsk);
+
+		spin_lock_bh(&pernet->lsk_list_lock);
+		list_del_rcu(&lsk_ref->list);
+		spin_unlock_bh(&pernet->lsk_list_lock);
+
+		kfree_rcu(lsk_ref, rcu);
+	}
+}
+
 static bool address_zero(const struct mptcp_addr_info *addr)
 {
 	struct mptcp_addr_info zero;
@@ -2141,12 +2215,14 @@  static int __net_init pm_nl_init_net(struct net *net)
 	struct pm_nl_pernet *pernet = net_generic(net, pm_nl_pernet_id);
 
 	INIT_LIST_HEAD_RCU(&pernet->local_addr_list);
+	INIT_LIST_HEAD_RCU(&pernet->lsk_list);
 
 	/* Cit. 2 subflows ought to be enough for anybody. */
 	pernet->subflows_max = 2;
 	pernet->next_id = 1;
 	pernet->stale_loss_cnt = 4;
 	spin_lock_init(&pernet->lock);
+	spin_lock_init(&pernet->lsk_list_lock);
 
 	/* No need to initialize other pernet fields, the struct is zeroed at
 	 * allocation time.