Message ID | 20220203072508.3072309-6-kishen.maloor@intel.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | mptcp: fixes and enhancements related to path management | expand |
Context | Check | Description |
---|---|---|
matttbe/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 107 lines checked |
matttbe/build | fail | Build error with: -Werror |
matttbe/KVM_Validation__normal | warning | Unstable: 1 failed test(s): selftest_mptcp_join |
matttbe/KVM_Validation__debug | warning | Unstable: 2 failed test(s): selftest_diag selftest_mptcp_join |
Kishen Maloor <kishen.maloor@intel.com> wrote: > The kernel can create listening sockets bound to announced addresses > via the ADD_ADDR option for receiving MP_JOIN requests. Path > managers may further choose to advertise the same addr+port over multiple > MPTCP connections. So this change provides a simple framework to > manage a list of all distinct listning sockets created in the kernel > over a namespace by encapsulating the socket in a structure that is > ref counted and can be shared across multiple connections. The sockets > are released when there are no more references. I think it makes sense to work on a hook in tcp v4/v6 input path that gets called for th->syn && !th->ack && no-listener-found case. The hook would: 1. retrieve join token, fetch mptcp_sock and allow 3whs to continue if things look ok from mptcp p.o.v. 2. return "go ahead and send tcp rst" or "mptcp magic, skb stolen" to the tcp stack. This also makes sure that plain tcp or mptcp connect requests will not work for addresses that did not go through socket/bind/listen API. I will try to prototype something next week. Given that hook lives in an error path (from tcp point of view) I think its going to be OK from a upstreaming point of view. It hopefully avoids the need for "magic listener sockets", and avoids kernel fighting with userspace applications over which address:port pairs are really useable. The latter is a concern IMO, esp. with reuseport and other round-robin schemes, I don't want mptcp layer to interfere with other application running on same host.
On 2/3/22 9:46 AM, Florian Westphal wrote: > Kishen Maloor <kishen.maloor@intel.com> wrote: >> The kernel can create listening sockets bound to announced addresses >> via the ADD_ADDR option for receiving MP_JOIN requests. Path >> managers may further choose to advertise the same addr+port over multiple >> MPTCP connections. So this change provides a simple framework to >> manage a list of all distinct listning sockets created in the kernel >> over a namespace by encapsulating the socket in a structure that is >> ref counted and can be shared across multiple connections. The sockets >> are released when there are no more references. > > I think it makes sense to work on a hook in tcp v4/v6 input path > that gets called for th->syn && !th->ack && no-listener-found case. > > The hook would: > 1. retrieve join token, fetch mptcp_sock and allow 3whs to continue > if things look ok from mptcp p.o.v. > 2. return "go ahead and send tcp rst" or "mptcp magic, skb stolen" > to the tcp stack. > > This also makes sure that plain tcp or mptcp connect requests will > not work for addresses that did not go through socket/bind/listen API. > > I will try to prototype something next week. > Sounds good. > Given that hook lives in an error path (from tcp point of view) > I think its going to be OK from a upstreaming point of view. > > It hopefully avoids the need for "magic listener sockets", and avoids > kernel fighting with userspace applications over which address:port > pairs are really useable. > Will this also obviate the need for listeners we currently create for port-based endpoints? > The latter is a concern IMO, esp. with reuseport and other round-robin > schemes, I don't want mptcp layer to interfere with other application > running on same host. Indeed if there are active/legacy TCP deployments that cannot be reconfigured with the NO_LISTEN flag, then we could choose to stick with the current default behavior and introduce a LISTEN flag (and additionally a NO_LISTEN flag to not create listeners for port-based endpoints as discussed earlier today). Further, if it's possible, we could also update the MPTCP layer to not accept MPC attempts over listeners created in the kernel to address that matter? My only immediate concern (on the surface) with your proposal (which may not actually be a worry upon assessment) is any potential risks of spurious MPJ attempts over what might be a "slow"(?) error path - just an early thought for consideration. But look forward to seeing your changes! Cheers, -Kishen.
Kishen Maloor <kishen.maloor@intel.com> wrote: > > Given that hook lives in an error path (from tcp point of view) > > I think its going to be OK from a upstreaming point of view. > > > > It hopefully avoids the need for "magic listener sockets", and avoids > > kernel fighting with userspace applications over which address:port > > pairs are really useable. > > > > Will this also obviate the need for listeners we currently create for port-based > endpoints? Hopefully yes. > Indeed if there are active/legacy TCP deployments that cannot be reconfigured with the > NO_LISTEN flag, then we could choose to stick with the current default behavior > and introduce a LISTEN flag (and additionally a NO_LISTEN flag to not create listeners for > port-based endpoints as discussed earlier today). Further, if it's possible, we could > also update the MPTCP layer to not accept MPC attempts over listeners created in the > kernel to address that matter? Yes, we could do that, I suggest to wait and see how the "syn/join hook" works out.
On Thu, 3 Feb 2022, Florian Westphal wrote: > Kishen Maloor <kishen.maloor@intel.com> wrote: >> The kernel can create listening sockets bound to announced addresses >> via the ADD_ADDR option for receiving MP_JOIN requests. Path >> managers may further choose to advertise the same addr+port over multiple >> MPTCP connections. So this change provides a simple framework to >> manage a list of all distinct listning sockets created in the kernel >> over a namespace by encapsulating the socket in a structure that is >> ref counted and can be shared across multiple connections. The sockets >> are released when there are no more references. > > I think it makes sense to work on a hook in tcp v4/v6 input path > that gets called for th->syn && !th->ack && no-listener-found case. > > The hook would: > 1. retrieve join token, fetch mptcp_sock and allow 3whs to continue > if things look ok from mptcp p.o.v. > 2. return "go ahead and send tcp rst" or "mptcp magic, skb stolen" > to the tcp stack. > This is basically the approach the multipath-tcp.org kernel takes. I think we initially decided to use listening sockets instead since it was less invasive, but now we have found the limits of the listener approach. It even looks like your proposal adheres more closely to https://datatracker.ietf.org/doc/html/rfc8684.html#section-3.2 (last paragraph): "Demultiplexing subflow SYNs MUST be done using the token" I hadn't noticed that "MUST" before. Do you think the token lookups should depend on anything other than the token value and net namespace - for example, to make sure someone can't use a public interface to try to brute-force potential tokens in use on other interfaces? The HMACs later in the handshake would guard against an actual connection (but would burn some extra CPU cycles). I guess this isn't a fundamentally different problem than we have today if there are any MPTCP listeners on public interfaces, it just means they don't have to work out the port number first. > This also makes sure that plain tcp or mptcp connect requests will > not work for addresses that did not go through socket/bind/listen API. > > I will try to prototype something next week. > Thanks! > Given that hook lives in an error path (from tcp point of view) > I think its going to be OK from a upstreaming point of view. > Seems reasonable to me. > It hopefully avoids the need for "magic listener sockets", and avoids > kernel fighting with userspace applications over which address:port > pairs are really useable. > > The latter is a concern IMO, esp. with reuseport and other round-robin > schemes, I don't want mptcp layer to interfere with other application > running on same host. The complexity of the magic listener sockets - creating and managing them, as well as mysterious interactions with userspace - does seem more significant than the TCP changes, so I hope the prototype works out. -- Mat Martineau Intel
On Thu, 2022-02-03 at 17:02 -0800, Mat Martineau wrote: > On Thu, 3 Feb 2022, Florian Westphal wrote: > > > Kishen Maloor <kishen.maloor@intel.com> wrote: > > > The kernel can create listening sockets bound to announced addresses > > > via the ADD_ADDR option for receiving MP_JOIN requests. Path > > > managers may further choose to advertise the same addr+port over multiple > > > MPTCP connections. So this change provides a simple framework to > > > manage a list of all distinct listning sockets created in the kernel > > > over a namespace by encapsulating the socket in a structure that is > > > ref counted and can be shared across multiple connections. The sockets > > > are released when there are no more references. > > > > I think it makes sense to work on a hook in tcp v4/v6 input path > > that gets called for th->syn && !th->ack && no-listener-found case. > > > > The hook would: > > 1. retrieve join token, fetch mptcp_sock and allow 3whs to continue > > if things look ok from mptcp p.o.v. > > 2. return "go ahead and send tcp rst" or "mptcp magic, skb stolen" > > to the tcp stack. > > > > This is basically the approach the multipath-tcp.org kernel takes. I think > we initially decided to use listening sockets instead since it was less > invasive, but now we have found the limits of the listener approach. > > It even looks like your proposal adheres more closely to > https://datatracker.ietf.org/doc/html/rfc8684.html#section-3.2 (last > paragraph): "Demultiplexing subflow SYNs MUST be done using the token" > > I hadn't noticed that "MUST" before. > > Do you think the token lookups should depend on anything other than the > token value and net namespace - for example, to make sure someone can't > use a public interface to try to brute-force potential tokens in use on > other interfaces? The HMACs later in the handshake would guard against an > actual connection (but would burn some extra CPU cycles). I guess this > isn't a fundamentally different problem than we have today if there are > any MPTCP listeners on public interfaces, it just means they don't have to > work out the port number first. The matching MPJ token could/should still be validated vs (signal) endpoints available in the relevant ns. Attackers should be able to brute force only via IPs that are reachables and announced as signal endpoints. It should be safe IMHO. Overall LGTM Thanks! Paolo
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c index ff13012178ae..3d6251baef26 100644 --- a/net/mptcp/pm_netlink.c +++ b/net/mptcp/pm_netlink.c @@ -22,6 +22,14 @@ static struct genl_family mptcp_genl_family; static int pm_nl_pernet_id; +struct mptcp_local_lsk { + struct list_head list; + struct mptcp_addr_info addr; + struct socket *lsk; + struct rcu_head rcu; + refcount_t refcount; +}; + struct mptcp_pm_addr_entry { struct list_head list; struct mptcp_addr_info addr; @@ -41,7 +49,10 @@ struct mptcp_pm_add_entry { struct pm_nl_pernet { /* protects pernet updates */ spinlock_t lock; + /* protects access to pernet lsk list */ + spinlock_t lsk_list_lock; struct list_head local_addr_list; + struct list_head lsk_list; unsigned int addrs; unsigned int stale_loss_cnt; unsigned int add_addr_signal_max; @@ -83,6 +94,69 @@ static bool addresses_equal(const struct mptcp_addr_info *a, return a->port == b->port; } +static struct mptcp_local_lsk *lsk_list_find(struct pm_nl_pernet *pernet, + struct mptcp_addr_info *addr) +{ + struct mptcp_local_lsk *lsk_ref = NULL; + struct mptcp_local_lsk *i; + + rcu_read_lock(); + + list_for_each_entry_rcu(i, &pernet->lsk_list, list) { + if (addresses_equal(&i->addr, addr, true)) { + if (refcount_inc_not_zero(&i->refcount)) { + lsk_ref = i; + break; + } + } + } + + rcu_read_unlock(); + + return lsk_ref; +} + +static void lsk_list_add_ref(struct mptcp_local_lsk *lsk_ref) +{ + refcount_inc(&lsk_ref->refcount); +} + +static struct mptcp_local_lsk *lsk_list_add(struct pm_nl_pernet *pernet, + struct mptcp_addr_info *addr, + struct socket *lsk) +{ + struct mptcp_local_lsk *lsk_ref; + + lsk_ref = kmalloc(sizeof(*lsk_ref), GFP_ATOMIC); + + if (!lsk_ref) + return NULL; + + lsk_ref->lsk = lsk; + memcpy(&lsk_ref->addr, addr, sizeof(struct mptcp_addr_info)); + refcount_set(&lsk_ref->refcount, 1); + + spin_lock_bh(&pernet->lsk_list_lock); + list_add_rcu(&lsk_ref->list, &pernet->lsk_list); + spin_unlock_bh(&pernet->lsk_list_lock); + + return lsk_ref; +} + +static void lsk_list_release(struct pm_nl_pernet *pernet, + struct mptcp_local_lsk *lsk_ref) +{ + if (lsk_ref && refcount_dec_and_test(&lsk_ref->refcount)) { + sock_release(lsk_ref->lsk); + + spin_lock_bh(&pernet->lsk_list_lock); + list_del_rcu(&lsk_ref->list); + spin_unlock_bh(&pernet->lsk_list_lock); + + kfree_rcu(lsk_ref, rcu); + } +} + static bool address_zero(const struct mptcp_addr_info *addr) { struct mptcp_addr_info zero; @@ -2141,12 +2215,14 @@ static int __net_init pm_nl_init_net(struct net *net) struct pm_nl_pernet *pernet = net_generic(net, pm_nl_pernet_id); INIT_LIST_HEAD_RCU(&pernet->local_addr_list); + INIT_LIST_HEAD_RCU(&pernet->lsk_list); /* Cit. 2 subflows ought to be enough for anybody. */ pernet->subflows_max = 2; pernet->next_id = 1; pernet->stale_loss_cnt = 4; spin_lock_init(&pernet->lock); + spin_lock_init(&pernet->lsk_list_lock); /* No need to initialize other pernet fields, the struct is zeroed at * allocation time.
The kernel can create listening sockets bound to announced addresses via the ADD_ADDR option for receiving MP_JOIN requests. Path managers may further choose to advertise the same addr+port over multiple MPTCP connections. So this change provides a simple framework to manage a list of all distinct listning sockets created in the kernel over a namespace by encapsulating the socket in a structure that is ref counted and can be shared across multiple connections. The sockets are released when there are no more references. Signed-off-by: Kishen Maloor <kishen.maloor@intel.com> --- v2: fixed formatting --- net/mptcp/pm_netlink.c | 76 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+)