diff mbox series

[net] l2tp: Don't sleep and disable BH under writer-side sk_callback_lock

Message ID 20221119130317.39158-1-jakub@cloudflare.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] l2tp: Don't sleep and disable BH under writer-side sk_callback_lock | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 52 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jakub Sitnicki Nov. 19, 2022, 1:03 p.m. UTC
When holding a reader-writer spin lock we cannot sleep. Calling
setup_udp_tunnel_sock() with write lock held violates this rule, because we
end up calling percpu_down_read(), which might sleep, as syzbot reports
[1]:

 __might_resched.cold+0x222/0x26b kernel/sched/core.c:9890
 percpu_down_read include/linux/percpu-rwsem.h:49 [inline]
 cpus_read_lock+0x1b/0x140 kernel/cpu.c:310
 static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158
 udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline]
 setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81
 l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509
 pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723

Trim the writer-side critical section for sk_callback_lock down to the
minimum, so that it covers only operations on sk_user_data.

Also, when grabbing the sk_callback_lock, we always need to disable BH, as
Eric points out. Failing to do so leads to deadlocks because we acquire
sk_callback_lock in softirq context, which can get stuck waiting on us if:

1) it runs on the same CPU, or

       CPU0
       ----
  lock(clock-AF_INET6);
  <Interrupt>
    lock(clock-AF_INET6);

2) lock ordering leads to priority inversion

       CPU0                    CPU1
       ----                    ----
  lock(clock-AF_INET6);
                               local_irq_disable();
                               lock(&tcp_hashinfo.bhash[i].lock);
                               lock(clock-AF_INET6);
  <Interrupt>
    lock(&tcp_hashinfo.bhash[i].lock);

... as syzbot reports [2,3]. Use the _bh variants for write_(un)lock.

[1] https://lore.kernel.org/netdev/0000000000004e78ec05eda79749@google.com/
[2] https://lore.kernel.org/netdev/000000000000e38b6605eda76f98@google.com/
[3] https://lore.kernel.org/netdev/000000000000dfa31e05eda76f75@google.com/

Cc: Tom Parkin <tparkin@katalix.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: b68777d54fac ("l2tp: Serialize access to sk_user_data with sk_callback_lock")
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+703d9e154b3b58277261@syzkaller.appspotmail.com
Reported-by: syzbot+50680ced9e98a61f7698@syzkaller.appspotmail.com
Reported-by: syzbot+de987172bb74a381879b@syzkaller.appspotmail.com
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/l2tp/l2tp_core.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

Comments

Tetsuo Handa Nov. 19, 2022, 1:52 p.m. UTC | #1
On 2022/11/19 22:03, Jakub Sitnicki wrote:
> When holding a reader-writer spin lock we cannot sleep. Calling
> setup_udp_tunnel_sock() with write lock held violates this rule, because we
> end up calling percpu_down_read(), which might sleep, as syzbot reports
> [1]:
> 
>  __might_resched.cold+0x222/0x26b kernel/sched/core.c:9890
>  percpu_down_read include/linux/percpu-rwsem.h:49 [inline]
>  cpus_read_lock+0x1b/0x140 kernel/cpu.c:310
>  static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158
>  udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline]
>  setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81
>  l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509
>  pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723
> 
> Trim the writer-side critical section for sk_callback_lock down to the
> minimum, so that it covers only operations on sk_user_data.

This patch does not look correct.

Since l2tp_validate_socket() checks that sk->sk_user_data == NULL with
sk->sk_callback_lock held, you need to call rcu_assign_sk_user_data(sk, tunnel)
before releasing sk->sk_callback_lock.
Tetsuo Handa Nov. 19, 2022, 2:27 p.m. UTC | #2
On 2022/11/19 22:52, Tetsuo Handa wrote:
> On 2022/11/19 22:03, Jakub Sitnicki wrote:
>> When holding a reader-writer spin lock we cannot sleep. Calling
>> setup_udp_tunnel_sock() with write lock held violates this rule, because we
>> end up calling percpu_down_read(), which might sleep, as syzbot reports
>> [1]:
>>
>>  __might_resched.cold+0x222/0x26b kernel/sched/core.c:9890
>>  percpu_down_read include/linux/percpu-rwsem.h:49 [inline]
>>  cpus_read_lock+0x1b/0x140 kernel/cpu.c:310
>>  static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158
>>  udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline]
>>  setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81
>>  l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509
>>  pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723
>>
>> Trim the writer-side critical section for sk_callback_lock down to the
>> minimum, so that it covers only operations on sk_user_data.
> 
> This patch does not look correct.
> 
> Since l2tp_validate_socket() checks that sk->sk_user_data == NULL with
> sk->sk_callback_lock held, you need to call rcu_assign_sk_user_data(sk, tunnel)
> before releasing sk->sk_callback_lock.
> 

Is it safe to temporarily set a dummy pointer like below?
If it is not safe, what makes assignments done by setup_udp_tunnel_sock() safe?

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 754fdda8a5f5..198d38d8fceb 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1474,11 +1474,12 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net,
 	}
 
 	sk = sock->sk;
-	write_lock(&sk->sk_callback_lock);
-
+	write_lock_bh(&sk->sk_callback_lock);
 	ret = l2tp_validate_socket(sk, net, tunnel->encap);
 	if (ret < 0)
 		goto err_sock;
+	rcu_assign_sk_user_data(sk, (void *) 1);
+	write_unlock_bh(&sk->sk_callback_lock);
 
 	tunnel->l2tp_net = net;
 	pn = l2tp_pernet(net);
@@ -1492,6 +1493,8 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net,
 			spin_unlock_bh(&pn->l2tp_tunnel_list_lock);
 			sock_put(sk);
 			ret = -EEXIST;
+			write_lock_bh(&sk->sk_callback_lock);
+			rcu_assign_sk_user_data(sk, NULL);
 			goto err_sock;
 		}
 	}
@@ -1522,16 +1525,15 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net,
 	if (tunnel->fd >= 0)
 		sockfd_put(sock);
 
-	write_unlock(&sk->sk_callback_lock);
 	return 0;
 
 err_sock:
+	write_unlock_bh(&sk->sk_callback_lock);
+
 	if (tunnel->fd < 0)
 		sock_release(sock);
 	else
 		sockfd_put(sock);
-
-	write_unlock(&sk->sk_callback_lock);
 err:
 	return ret;
 }
Jakub Sitnicki Nov. 21, 2022, 9 a.m. UTC | #3
On Sat, Nov 19, 2022 at 10:52 PM +09, Tetsuo Handa wrote:
> On 2022/11/19 22:03, Jakub Sitnicki wrote:
>> When holding a reader-writer spin lock we cannot sleep. Calling
>> setup_udp_tunnel_sock() with write lock held violates this rule, because we
>> end up calling percpu_down_read(), which might sleep, as syzbot reports
>> [1]:
>> 
>>  __might_resched.cold+0x222/0x26b kernel/sched/core.c:9890
>>  percpu_down_read include/linux/percpu-rwsem.h:49 [inline]
>>  cpus_read_lock+0x1b/0x140 kernel/cpu.c:310
>>  static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158
>>  udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline]
>>  setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81
>>  l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509
>>  pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723
>> 
>> Trim the writer-side critical section for sk_callback_lock down to the
>> minimum, so that it covers only operations on sk_user_data.
>
> This patch does not look correct.
>
> Since l2tp_validate_socket() checks that sk->sk_user_data == NULL with
> sk->sk_callback_lock held, you need to call rcu_assign_sk_user_data(sk, tunnel)
> before releasing sk->sk_callback_lock.

You're right. v2 posted:

https://lore.kernel.org/netdev/20221121085426.21315-1-jakub@cloudflare.com/
Jakub Sitnicki Nov. 21, 2022, 9 a.m. UTC | #4
On Sat, Nov 19, 2022 at 11:27 PM +09, Tetsuo Handa wrote:
> On 2022/11/19 22:52, Tetsuo Handa wrote:
>> On 2022/11/19 22:03, Jakub Sitnicki wrote:
>>> When holding a reader-writer spin lock we cannot sleep. Calling
>>> setup_udp_tunnel_sock() with write lock held violates this rule, because we
>>> end up calling percpu_down_read(), which might sleep, as syzbot reports
>>> [1]:
>>>
>>>  __might_resched.cold+0x222/0x26b kernel/sched/core.c:9890
>>>  percpu_down_read include/linux/percpu-rwsem.h:49 [inline]
>>>  cpus_read_lock+0x1b/0x140 kernel/cpu.c:310
>>>  static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158
>>>  udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline]
>>>  setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81
>>>  l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509
>>>  pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723
>>>
>>> Trim the writer-side critical section for sk_callback_lock down to the
>>> minimum, so that it covers only operations on sk_user_data.
>> 
>> This patch does not look correct.
>> 
>> Since l2tp_validate_socket() checks that sk->sk_user_data == NULL with
>> sk->sk_callback_lock held, you need to call rcu_assign_sk_user_data(sk, tunnel)
>> before releasing sk->sk_callback_lock.
>> 
>
> Is it safe to temporarily set a dummy pointer like below?
> If it is not safe, what makes assignments done by
> setup_udp_tunnel_sock() safe?

Yes, I think so. Great idea. I've used it in v2.

We can check & assign sk_user_data under sk_callback_lock, and then just
let setup_udp_tunnel_sock overwrite it with the same value, without
holding the lock.

I still think that it's best to keep the critical section as short as
possible, though.

[...]
Tetsuo Handa Nov. 21, 2022, 10:03 a.m. UTC | #5
On 2022/11/21 18:00, Jakub Sitnicki wrote:
>> Is it safe to temporarily set a dummy pointer like below?
>> If it is not safe, what makes assignments done by
>> setup_udp_tunnel_sock() safe?
> 
> Yes, I think so. Great idea. I've used it in v2.

So, you are sure that e.g.

	udp_sk(sk)->gro_receive = cfg->gro_receive;

in setup_udp_tunnel_sock() (where the caller is passing cfg->gro_receive == NULL)
never races with e.g. below check (because the socket might be sending/receiving
in progress since the socket is retrieved via user-specified file descriptor) ?

Then, v2 patch would be OK for fixing this regression. (But I think we still should
avoid retrieving a socket from user-specified file descriptor in order to avoid
lockdep race window.)


struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
                                struct udphdr *uh, struct sock *sk)
{
	(...snipped...)
        if (!sk || !udp_sk(sk)->gro_receive) {
		(...snipped...)
                /* no GRO, be sure flush the current packet */
                goto out;
        }
	(...snipped...)
        pp = call_gro_receive_sk(udp_sk(sk)->gro_receive, sk, head, skb);
out:
        skb_gro_flush_final(skb, pp, flush);
        return pp;
}

> 
> We can check & assign sk_user_data under sk_callback_lock, and then just
> let setup_udp_tunnel_sock overwrite it with the same value, without
> holding the lock.

Given that sk_user_data is RCU-protected on reader-side, don't we need to
wait for RCU grace period after resetting to NULL?

> 
> I still think that it's best to keep the critical section as short as
> possible, though.
Jakub Sitnicki Nov. 21, 2022, 9:55 p.m. UTC | #6
On Mon, Nov 21, 2022 at 07:03 PM +09, Tetsuo Handa wrote:
> On 2022/11/21 18:00, Jakub Sitnicki wrote:
>>> Is it safe to temporarily set a dummy pointer like below?
>>> If it is not safe, what makes assignments done by
>>> setup_udp_tunnel_sock() safe?
>> 
>> Yes, I think so. Great idea. I've used it in v2.
>
> So, you are sure that e.g.
>
> 	udp_sk(sk)->gro_receive = cfg->gro_receive;
>
> in setup_udp_tunnel_sock() (where the caller is passing cfg->gro_receive == NULL)
> never races with e.g. below check (because the socket might be sending/receiving
> in progress since the socket is retrieved via user-specified file descriptor) ?
>
> Then, v2 patch would be OK for fixing this regression. (But I think we still should
> avoid retrieving a socket from user-specified file descriptor in order to avoid
> lockdep race window.)
>
>
> struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
>                                 struct udphdr *uh, struct sock *sk)
> {
> 	(...snipped...)
>         if (!sk || !udp_sk(sk)->gro_receive) {
> 		(...snipped...)
>                 /* no GRO, be sure flush the current packet */
>                 goto out;
>         }
> 	(...snipped...)
>         pp = call_gro_receive_sk(udp_sk(sk)->gro_receive, sk, head, skb);
> out:
>         skb_gro_flush_final(skb, pp, flush);
>         return pp;
> }
>

First, let me say, that I get the impression that setup_udp_tunnel_sock
was not really meant to be used on pre-existing sockets created by
user-space. Even though l2tp and gtp seem to be doing that.

That is because, I don't see how it could be used properly. Given that
we need to check-and-set sk_user_data under sk_callback_lock, which
setup_udp_tunnel_sock doesn't grab itself. At the same time it might
sleep. There is no way to apply it without resorting to tricks, like we
did here.

So - yeah - there may be other problems. But if there are, they are not
related to the faulty commit b68777d54fac ("l2tp: Serialize access to
sk_user_data with sk_callback_lock"), which we're trying to fix. There
was no locking present in l2tp_tunnel_register before that point.

>> 
>> We can check & assign sk_user_data under sk_callback_lock, and then just
>> let setup_udp_tunnel_sock overwrite it with the same value, without
>> holding the lock.
>
> Given that sk_user_data is RCU-protected on reader-side, don't we need to
> wait for RCU grace period after resetting to NULL?

Who would be the reader?

If you have l2tp_udp_encap_recv in mind - the encap_rcv callback has not
been set yet, if we are on the error handling path in
l2tp_tunnel_register.

In general, though, yes - I think you are right. We must synchronize_rcu
before we kfree the tunnel.

However, that is also not related to the race to check-and-set
sk_user_data, which commit b68777d54fac is trying to fix.
Tetsuo Handa Nov. 22, 2022, 9:48 a.m. UTC | #7
On 2022/11/22 6:55, Jakub Sitnicki wrote:
> First, let me say, that I get the impression that setup_udp_tunnel_sock
> was not really meant to be used on pre-existing sockets created by
> user-space. Even though l2tp and gtp seem to be doing that.
> 
> That is because, I don't see how it could be used properly. Given that
> we need to check-and-set sk_user_data under sk_callback_lock, which
> setup_udp_tunnel_sock doesn't grab itself. At the same time it might
> sleep. There is no way to apply it without resorting to tricks, like we
> did here.
> 
> So - yeah - there may be other problems. But if there are, they are not
> related to the faulty commit b68777d54fac ("l2tp: Serialize access to
> sk_user_data with sk_callback_lock"), which we're trying to fix. There
> was no locking present in l2tp_tunnel_register before that point.

https://syzkaller.appspot.com/bug?extid=94cc2a66fc228b23f360 is the one
where changing lockdep class is concurrently done on pre-existing sockets.

I think we need to always create a new socket inside l2tp_tunnel_register(),
rather than trying to serialize setting of sk_user_data under sk_callback_lock.

> However, that is also not related to the race to check-and-set
> sk_user_data, which commit b68777d54fac is trying to fix.

Therefore, I feel that reverting commit b68777d54fac "l2tp: Serialize access
to sk_user_data with sk_callback_lock" might be the better choice.
Jakub Sitnicki Nov. 22, 2022, 10:46 a.m. UTC | #8
On Tue, Nov 22, 2022 at 06:48 PM +09, Tetsuo Handa wrote:
> On 2022/11/22 6:55, Jakub Sitnicki wrote:
>> First, let me say, that I get the impression that setup_udp_tunnel_sock
>> was not really meant to be used on pre-existing sockets created by
>> user-space. Even though l2tp and gtp seem to be doing that.
>> 
>> That is because, I don't see how it could be used properly. Given that
>> we need to check-and-set sk_user_data under sk_callback_lock, which
>> setup_udp_tunnel_sock doesn't grab itself. At the same time it might
>> sleep. There is no way to apply it without resorting to tricks, like we
>> did here.
>> 
>> So - yeah - there may be other problems. But if there are, they are not
>> related to the faulty commit b68777d54fac ("l2tp: Serialize access to
>> sk_user_data with sk_callback_lock"), which we're trying to fix. There
>> was no locking present in l2tp_tunnel_register before that point.
>
> https://syzkaller.appspot.com/bug?extid=94cc2a66fc228b23f360 is the one
> where changing lockdep class is concurrently done on pre-existing sockets.
>
> I think we need to always create a new socket inside l2tp_tunnel_register(),
> rather than trying to serialize setting of sk_user_data under sk_callback_lock.

While that would be easier to handle, I don't see how it can be done in
a backward-compatible way. User-space is allowed to pass a socket to
l2tp today [1].

>
>> However, that is also not related to the race to check-and-set
>> sk_user_data, which commit b68777d54fac is trying to fix.
>
> Therefore, I feel that reverting commit b68777d54fac "l2tp: Serialize access
> to sk_user_data with sk_callback_lock" might be the better choice.

I'm okay with that. Providing we can come up with have an alternative
fix to the race between l2tp and other sk_user_data users.

[1] https://elixir.bootlin.com/linux/v6.1-rc6/source/net/l2tp/l2tp_netlink.c#L220
Tetsuo Handa Nov. 22, 2022, 11:14 a.m. UTC | #9
On 2022/11/22 19:46, Jakub Sitnicki wrote:
>> https://syzkaller.appspot.com/bug?extid=94cc2a66fc228b23f360 is the one
>> where changing lockdep class is concurrently done on pre-existing sockets.
>>
>> I think we need to always create a new socket inside l2tp_tunnel_register(),
>> rather than trying to serialize setting of sk_user_data under sk_callback_lock.
> 
> While that would be easier to handle, I don't see how it can be done in
> a backward-compatible way. User-space is allowed to pass a socket to
> l2tp today [1].

What is the expected usage of the socket which was passed to l2tp_tunnel_register() ?
Is the userspace supposed to just close() that socket? Or, is the userspace allowed to
continue using the socket?

If the userspace might continue using the socket, we would

  create a new socket, copy required attributes (the source and destination addresses?) from
  the socket fetched via sockfd_lookup(), and call replace_fd() like e.g. umh_pipe_setup() does

inside l2tp_tunnel_register(). i-node number of the socket would change, but I assume that
the process which called l2tp_tunnel_register() is not using that i-node number.

Since the socket is a datagram socket, I think we can copy required attributes. But since
I'm not familiar with networking code, I don't know what attributes need to be copied. Thus,
I leave implementing it to netdev people.
Guillaume Nault Nov. 22, 2022, 2:10 p.m. UTC | #10
On Tue, Nov 22, 2022 at 08:14:33PM +0900, Tetsuo Handa wrote:
> On 2022/11/22 19:46, Jakub Sitnicki wrote:
> >> https://syzkaller.appspot.com/bug?extid=94cc2a66fc228b23f360 is the one
> >> where changing lockdep class is concurrently done on pre-existing sockets.
> >>
> >> I think we need to always create a new socket inside l2tp_tunnel_register(),
> >> rather than trying to serialize setting of sk_user_data under sk_callback_lock.
> > 
> > While that would be easier to handle, I don't see how it can be done in
> > a backward-compatible way. User-space is allowed to pass a socket to
> > l2tp today [1].
> 
> What is the expected usage of the socket which was passed to l2tp_tunnel_register() ?

It receives L2TP packets. Those can be either control or data ones.
Data packets are processed by the kernel. Control packets are queued to
user space.

> Is the userspace supposed to just close() that socket? Or, is the userspace allowed to
> continue using the socket?

User space uses this socket to send and receive L2TP control packets
(tunnel and session configuration, keep alive and tear down). Therefore
it absolutely needs to continue using this socket after the
registration phase.

> If the userspace might continue using the socket, we would
> 
>   create a new socket, copy required attributes (the source and destination addresses?) from
>   the socket fetched via sockfd_lookup(), and call replace_fd() like e.g. umh_pipe_setup() does
> 
> inside l2tp_tunnel_register(). i-node number of the socket would change, but I assume that
> the process which called l2tp_tunnel_register() is not using that i-node number.
> 
> Since the socket is a datagram socket, I think we can copy required attributes. But since
> I'm not familiar with networking code, I don't know what attributes need to be copied. Thus,
> I leave implementing it to netdev people.

That looks fragile to me. If the problem is that setup_udp_tunnel_sock()
can sleep, we can just drop the udp_tunnel_encap_enable() call from
setup_udp_tunnel_sock(), rename it __udp_tunnel_encap_enable() and make
make udp_tunnel_encap_enable() a wrapper around it that'd also call
udp_tunnel_encap_enable().
Tetsuo Handa Nov. 22, 2022, 2:28 p.m. UTC | #11
On 2022/11/22 23:10, Guillaume Nault wrote:
> User space uses this socket to send and receive L2TP control packets
> (tunnel and session configuration, keep alive and tear down). Therefore
> it absolutely needs to continue using this socket after the
> registration phase.

Thank you for explanation.

>> If the userspace might continue using the socket, we would
>>
>>   create a new socket, copy required attributes (the source and destination addresses?) from
>>   the socket fetched via sockfd_lookup(), and call replace_fd() like e.g. umh_pipe_setup() does
>>
>> inside l2tp_tunnel_register(). i-node number of the socket would change, but I assume that
>> the process which called l2tp_tunnel_register() is not using that i-node number.
>>
>> Since the socket is a datagram socket, I think we can copy required attributes. But since
>> I'm not familiar with networking code, I don't know what attributes need to be copied. Thus,
>> I leave implementing it to netdev people.
> 
> That looks fragile to me. If the problem is that setup_udp_tunnel_sock()
> can sleep, we can just drop the udp_tunnel_encap_enable() call from
> setup_udp_tunnel_sock(), rename it __udp_tunnel_encap_enable() and make
> make udp_tunnel_encap_enable() a wrapper around it that'd also call
> udp_tunnel_encap_enable().
> 

That's what I thought at https://lkml.kernel.org/r/c64284f4-2c2a-ecb9-a08e-9e49d49c720b@I-love.SAKURA.ne.jp .

But the problem is not that setup_udp_tunnel_sock() can sleep. The problem is that lockdep
gets confused due to changing lockdep class after the socket is already published. We need
to avoid calling lockdep_set_class_and_name() on a socket retrieved via sockfd_lookup().
Guillaume Nault Nov. 23, 2022, 3:24 p.m. UTC | #12
On Tue, Nov 22, 2022 at 11:28:45PM +0900, Tetsuo Handa wrote:
> On 2022/11/22 23:10, Guillaume Nault wrote:
> > User space uses this socket to send and receive L2TP control packets
> > (tunnel and session configuration, keep alive and tear down). Therefore
> > it absolutely needs to continue using this socket after the
> > registration phase.
> 
> Thank you for explanation.
> 
> >> If the userspace might continue using the socket, we would
> >>
> >>   create a new socket, copy required attributes (the source and destination addresses?) from
> >>   the socket fetched via sockfd_lookup(), and call replace_fd() like e.g. umh_pipe_setup() does
> >>
> >> inside l2tp_tunnel_register(). i-node number of the socket would change, but I assume that
> >> the process which called l2tp_tunnel_register() is not using that i-node number.
> >>
> >> Since the socket is a datagram socket, I think we can copy required attributes. But since
> >> I'm not familiar with networking code, I don't know what attributes need to be copied. Thus,
> >> I leave implementing it to netdev people.
> > 
> > That looks fragile to me. If the problem is that setup_udp_tunnel_sock()
> > can sleep, we can just drop the udp_tunnel_encap_enable() call from
> > setup_udp_tunnel_sock(), rename it __udp_tunnel_encap_enable() and make
> > make udp_tunnel_encap_enable() a wrapper around it that'd also call
> > udp_tunnel_encap_enable().
> > 
> 
> That's what I thought at https://lkml.kernel.org/r/c64284f4-2c2a-ecb9-a08e-9e49d49c720b@I-love.SAKURA.ne.jp .
> 
> But the problem is not that setup_udp_tunnel_sock() can sleep. The problem is that lockdep
> gets confused due to changing lockdep class after the socket is already published. We need
> to avoid calling lockdep_set_class_and_name() on a socket retrieved via sockfd_lookup().

This is a second problem. The problem of setting sk_user_data under
sk_callback_lock write protection (while still calling
udp_tunnel_encap_enable() from sleepable context) still remains.

For lockdep_set_class_and_name(), maybe we could store the necessary
socket information (addresses, ports and checksum configuration) in the
l2tp_tunnel structure, thus avoiding the need to read them from the
socket. This way, we could stop locking the user space socket in
l2tp_xmit_core() and drop the lockdep_set_class_and_name() call.
I think either you or Jakub proposed something like this in another
thread.
Tom Parkin Nov. 24, 2022, 10:07 a.m. UTC | #13
On  Wed, Nov 23, 2022 at 16:24:00 +0100, Guillaume Nault wrote:
> On Tue, Nov 22, 2022 at 11:28:45PM +0900, Tetsuo Handa wrote:
> > That's what I thought at https://lkml.kernel.org/r/c64284f4-2c2a-ecb9-a08e-9e49d49c720b@I-love.SAKURA.ne.jp .
> > 
> > But the problem is not that setup_udp_tunnel_sock() can sleep. The problem is that lockdep
> > gets confused due to changing lockdep class after the socket is already published. We need
> > to avoid calling lockdep_set_class_and_name() on a socket retrieved via sockfd_lookup().
> 
> This is a second problem. The problem of setting sk_user_data under
> sk_callback_lock write protection (while still calling
> udp_tunnel_encap_enable() from sleepable context) still remains.
> 
> For lockdep_set_class_and_name(), maybe we could store the necessary
> socket information (addresses, ports and checksum configuration) in the
> l2tp_tunnel structure, thus avoiding the need to read them from the
> socket. This way, we could stop locking the user space socket in
> l2tp_xmit_core() and drop the lockdep_set_class_and_name() call.
> I think either you or Jakub proposed something like this in another
> thread.

I note that l2tp_xmit_core calls ip_queue_xmit which expects a socket
atomic context*.

It also accesses struct inet_sock corking data which may also need locks
to safely access.

Possibly we could somehow work around that, but on the face of it we'd
need to do a bit more work to avoid the socket lock in the tx path.

* davem fixed locking in the l2tp xmit path in:

6af88da14ee2 ("l2tp: Fix locking in l2tp_core.c")
Guillaume Nault Nov. 24, 2022, 10:27 a.m. UTC | #14
On Thu, Nov 24, 2022 at 10:07:51AM +0000, Tom Parkin wrote:
> On  Wed, Nov 23, 2022 at 16:24:00 +0100, Guillaume Nault wrote:
> > On Tue, Nov 22, 2022 at 11:28:45PM +0900, Tetsuo Handa wrote:
> > > That's what I thought at https://lkml.kernel.org/r/c64284f4-2c2a-ecb9-a08e-9e49d49c720b@I-love.SAKURA.ne.jp .
> > > 
> > > But the problem is not that setup_udp_tunnel_sock() can sleep. The problem is that lockdep
> > > gets confused due to changing lockdep class after the socket is already published. We need
> > > to avoid calling lockdep_set_class_and_name() on a socket retrieved via sockfd_lookup().
> > 
> > This is a second problem. The problem of setting sk_user_data under
> > sk_callback_lock write protection (while still calling
> > udp_tunnel_encap_enable() from sleepable context) still remains.
> > 
> > For lockdep_set_class_and_name(), maybe we could store the necessary
> > socket information (addresses, ports and checksum configuration) in the
> > l2tp_tunnel structure, thus avoiding the need to read them from the
> > socket. This way, we could stop locking the user space socket in
> > l2tp_xmit_core() and drop the lockdep_set_class_and_name() call.
> > I think either you or Jakub proposed something like this in another
> > thread.
> 
> I note that l2tp_xmit_core calls ip_queue_xmit which expects a socket
> atomic context*.
> 
> It also accesses struct inet_sock corking data which may also need locks
> to safely access.
> 
> Possibly we could somehow work around that, but on the face of it we'd
> need to do a bit more work to avoid the socket lock in the tx path.

I was thinking of avoiding using the socket entirely, which indeed
means replacing ip_queue_xmit(). We should probably use the different
variants of udp_tunnel_xmit_skb() instead.

> * davem fixed locking in the l2tp xmit path in:
> 
> 6af88da14ee2 ("l2tp: Fix locking in l2tp_core.c")
> -- 
> Tom Parkin
> Katalix Systems Ltd
> https://katalix.com
> Catalysts for your Embedded Linux software development
diff mbox series

Patch

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 754fdda8a5f5..100d17908196 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1474,11 +1474,20 @@  int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net,
 	}
 
 	sk = sock->sk;
-	write_lock(&sk->sk_callback_lock);
-
+	write_lock_bh(&sk->sk_callback_lock);
 	ret = l2tp_validate_socket(sk, net, tunnel->encap);
 	if (ret < 0)
-		goto err_sock;
+		goto err_inval_sock;
+
+	switch (tunnel->encap) {
+	case L2TP_ENCAPTYPE_IP:
+		rcu_assign_sk_user_data(sk, tunnel);
+		break;
+	case L2TP_ENCAPTYPE_UDP:
+		/* nothing to do */
+		break;
+	}
+	write_unlock_bh(&sk->sk_callback_lock);
 
 	tunnel->l2tp_net = net;
 	pn = l2tp_pernet(net);
@@ -1507,8 +1516,6 @@  int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net,
 		};
 
 		setup_udp_tunnel_sock(net, sock, &udp_cfg);
-	} else {
-		rcu_assign_sk_user_data(sk, tunnel);
 	}
 
 	tunnel->old_sk_destruct = sk->sk_destruct;
@@ -1522,16 +1529,18 @@  int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net,
 	if (tunnel->fd >= 0)
 		sockfd_put(sock);
 
-	write_unlock(&sk->sk_callback_lock);
 	return 0;
 
 err_sock:
+	write_lock_bh(&sk->sk_callback_lock);
+	rcu_assign_sk_user_data(sk, NULL);
+err_inval_sock:
+	write_unlock_bh(&sk->sk_callback_lock);
+
 	if (tunnel->fd < 0)
 		sock_release(sock);
 	else
 		sockfd_put(sock);
-
-	write_unlock(&sk->sk_callback_lock);
 err:
 	return ret;
 }