Message ID | 20241113105611.GA6723@incl (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net/ipv6: release expired exception dst cached in socket | expand |
On 11/13/24 11:56, Jiri Wiesner wrote: > Dst objects get leaked in ip6_negative_advice() when this function is > executed for an expired IPv6 route located in the exception table. There > are several conditions that must be fulfilled for the leak to occur: > * an ICMPv6 packet indicating a change of the MTU for the path is received > * a TCP connection that uses the exception dst for routing packets must > start timing out so that TCP begins retransmissions > * after the exception dst expires, the FIB6 garbage collector must not run > before TCP executes ip6_negative_advice() for the expired exception dst > > The following steps reproduce the issue: > > ip link add veth1 mtu 65535 type veth peer veth0 mtu 65535 > ip netns add ns0 > ip link set veth1 netns ns0 > ip addr add fd00::1/24 dev veth0 > ip -n ns0 addr add fd00::2/24 dev veth1 > ip link set up dev veth0 > ip -n ns0 link set up dev lo > ip -n ns0 link set up dev veth1 > ip -n ns0 route add default via fd00::1 dev veth1 > > ip link add veth3 mtu 65535 type veth peer veth2 mtu 65535 > ip netns add ns2 > ip link set veth3 netns ns2 > ip addr add fd02::1/24 dev veth2 > ip -n ns2 addr add fd02::2/24 dev veth3 > ip link set up dev veth2 > ip -n ns2 link set up dev lo > ip -n ns2 link set up dev veth3 > ip -n ns2 route add default via fd02::1 dev veth3 > > ip netns exec ns0 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires" > ip netns exec ns0 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval" > sleep 30 > > ip6tables -A FORWARD -i veth0 -d fd02::/24 -j ACCEPT > ip6tables -A FORWARD -i veth2 -d fd00::/24 -j ACCEPT > echo 1 > /proc/sys/net/ipv6/conf/all/forwarding > > (ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$! > sleep 1 > dd if=/dev/zero bs=1M | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$! > sleep 1 > ip link set veth2 mtu 2000 > sleep 1 > ip6tables -D FORWARD -i veth2 -d fd00::/24 -j ACCEPT > ip6tables -A FORWARD -i veth2 -d fd00::/24 -j DROP > > sleep 10 > kill $clnt $serv > wait $serv > > ip6tables -D FORWARD -i veth0 -d fd02::/24 -j ACCEPT > ip6tables -D FORWARD -i veth2 -d fd00::/24 -j DROP > > ip -n ns0 link set down dev lo > ip -n ns0 link set down dev veth1 > ip -n ns0 link delete dev veth1 > ip netns delete ns0 > > ip -n ns2 link set down dev lo > ip -n ns2 link set down dev veth3 > ip -n ns2 link delete dev veth3 > ip netns delete ns2 > > This trace has been created with kprobes under kernel 6.12-rc7. Upon > receiving an ICMPv6 packet indicating an MTU change, exception dst > 0xff3027eec766c100 is created and inserted into the IPv6 exception table: > 3651.126884: rt6_insert_exception: (rt6_insert_exception+0x0/0x2b0) dst=0xff3027eec766c100 rcuref=0 > 3651.126889: <stack trace> > => rt6_insert_exception+0x5/0x2b0 > => __ip6_rt_update_pmtu+0x1ef/0x380 > => inet6_csk_update_pmtu+0x4b/0x90 > => tcp_v6_mtu_reduced.part.22+0x34/0xc0 > The exception dst is used to route packets: > 3651.126902: inet6_csk_route_socket: (inet6_csk_update_pmtu+0x58/0x90 <- inet6_csk_route_socket) dst=0xff3027eec766c100 > At this point, the connection has been severed and TCP starts retransmissions: > 3652.349466: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 > 3652.349497: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100 > 3652.769469: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 > 3652.769495: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100 > 3653.596135: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 > 3653.596156: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100 > 3655.249465: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 > 3655.249490: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100 > 3658.689463: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 > When ip6_negative_advice() is executed the refcount is 2 - the increment > made by dst_init() and the increment made by the socket: > 3658.689475: ip6_negative_advice: (ip6_negative_advice+0x0/0xa0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 dst=0xff3027eec766c100 rcuref=1 > This is the result of dst_hold() and sk_dst_reset() in ip6_negative_advice(): > 3658.689477: dst_release: (dst_release+0x0/0x80) dst=0xff3027eec766c100 rcuref=2 > 3658.689498: rt6_remove_exception_rt: (rt6_remove_exception_rt+0x0/0xa0) dst=0xff3027eec766c100 rcuref=1 > 3658.689501: rt6_remove_exception: (rt6_remove_exception.part.58+0x0/0xe0) dst=0xff3027eec766c100 rcuref=1 > The refcount of dst 0xff3027eec766c100 is decremented by 1 as a result of > removing the exception from the exception table with > rt6_remove_exception_rt(): > 3658.689505: dst_release: (dst_release+0x0/0x80) dst=0xff3027eec766c100 rcuref=1 > The retransmissions continue without the exception dst being used for > routing packets: > 3662.352796: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef00cab780 sk=0xff3027eeeb1f0000 > 3662.769470: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef00cab780 sk=0xff3027eeeb1f0000 > 3663.596132: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef00cab780 sk=0xff3027eeeb1f0000 > > The ip6_dst_destroy() function was instrumented but there was no entry for > dst 0xff3027eec766c100 in the trace even long after the ns0 and ns2 net > namespaces had been destroyed. The refcount made by the socket was never > released. The refcount of the dst is decremented in sk_dst_reset() but > that decrement is counteracted by a dst_hold() intentionally placed just > before the sk_dst_reset() in ip6_negative_advice(). The probem is that > sockets that keep a reference to a dst in the sk_dst_cache member > increment the refcount of the dst by 1. This is apparent in the following > code paths: > * When ip6_route_output_flags() finds a dst that is then stored in > sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket() > * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache > Provided the dst is not kept in the sk_dst_cache member of another socket, > there is no other object tied to the dst (the socket lost its reference > and the dst is no longer in the exception table) and the dst becomes a > leaked object after ip6_negative_advice() has finished. > > As a result of this dst leak, an unbalanced refcount is reported for the > loopback device of a net namespace being destroyed under kernels that do > not contain e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev"): > unregister_netdevice: waiting for lo to become free. Usage count = 2 > > Fix the dst leak by removing the dst_hold() in ip6_negative_advice(). The > patch that introduced the dst_hold() in ip6_negative_advice() was > 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22 > merely refactored the code with regards to the dst refcount so the issue > was present even before 92f1655aa2b22. The bug was introduced in > 54c1a859efd9f ("ipv6: Don't drop cache route entry unless timer actually > expired.") where the expired cached route is deleted, the sk_dst_cache > member of the socket is set to NULL by calling dst_negative_advice() but > the refcount belonging to the socket is left unbalanced. > > The IPv4 version - ipv4_negative_advice() - is not affected by this bug. A > nexthop exception is created and the dst associated with the socket fails > a check after the ICMPv6 packet indicating an MTU change has been > received. When the TCP connection times out ipv4_negative_advice() merely > resets the sk_dst_cache of the socket while decrementing the refcount of > the exception dst. Then, the expired nexthop exception is deleted along > with its routes in find_exception() while TCP tries to retransmit the same > packet again. > > Fixes: 92f1655aa2b22 ("net: fix __dst_negative_advice() race") > Fixes: 54c1a859efd9f ("ipv6: Don't drop cache route entry unless timer actually expired.") > > Signed-off-by: Jiri Wiesner <jwiesner@suse.de> The patch makes sense to me, but the changelog is very hard to follow, please: - move the testing code to a new, specific self-tests (possibly tuning the timeouts to a [much] shorter runtime) - clearly separate the probe logs from the describing commenting text - try to be slightly a bit less verbose Additionally please solve a formal problem above: there should be no white lines in the tag area between the Fixes and SoB tags. Thanks, Paolo
On Thu, Nov 14, 2024 at 12:04:36PM +0100, Paolo Abeni wrote: > On 11/13/24 11:56, Jiri Wiesner wrote: > > Dst objects get leaked in ip6_negative_advice() when this function is > > executed for an expired IPv6 route located in the exception table. There > > are several conditions that must be fulfilled for the leak to occur: > > * an ICMPv6 packet indicating a change of the MTU for the path is received > > * a TCP connection that uses the exception dst for routing packets must > > start timing out so that TCP begins retransmissions > > * after the exception dst expires, the FIB6 garbage collector must not run > > before TCP executes ip6_negative_advice() for the expired exception dst > > > > The following steps reproduce the issue: > > > > ip link add veth1 mtu 65535 type veth peer veth0 mtu 65535 > > ip netns add ns0 > > ip link set veth1 netns ns0 > > ip addr add fd00::1/24 dev veth0 > > ip -n ns0 addr add fd00::2/24 dev veth1 > > ip link set up dev veth0 > > ip -n ns0 link set up dev lo > > ip -n ns0 link set up dev veth1 > > ip -n ns0 route add default via fd00::1 dev veth1 > > > > ip link add veth3 mtu 65535 type veth peer veth2 mtu 65535 > > ip netns add ns2 > > ip link set veth3 netns ns2 > > ip addr add fd02::1/24 dev veth2 > > ip -n ns2 addr add fd02::2/24 dev veth3 > > ip link set up dev veth2 > > ip -n ns2 link set up dev lo > > ip -n ns2 link set up dev veth3 > > ip -n ns2 route add default via fd02::1 dev veth3 > > > > ip netns exec ns0 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires" > > ip netns exec ns0 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval" > > sleep 30 > > > > ip6tables -A FORWARD -i veth0 -d fd02::/24 -j ACCEPT > > ip6tables -A FORWARD -i veth2 -d fd00::/24 -j ACCEPT > > echo 1 > /proc/sys/net/ipv6/conf/all/forwarding > > > > (ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$! > > sleep 1 > > dd if=/dev/zero bs=1M | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$! > > sleep 1 > > ip link set veth2 mtu 2000 > > sleep 1 > > ip6tables -D FORWARD -i veth2 -d fd00::/24 -j ACCEPT > > ip6tables -A FORWARD -i veth2 -d fd00::/24 -j DROP > > > > sleep 10 > > kill $clnt $serv > > wait $serv > > > > ip6tables -D FORWARD -i veth0 -d fd02::/24 -j ACCEPT > > ip6tables -D FORWARD -i veth2 -d fd00::/24 -j DROP > > > > ip -n ns0 link set down dev lo > > ip -n ns0 link set down dev veth1 > > ip -n ns0 link delete dev veth1 > > ip netns delete ns0 > > > > ip -n ns2 link set down dev lo > > ip -n ns2 link set down dev veth3 > > ip -n ns2 link delete dev veth3 > > ip netns delete ns2 > > > > This trace has been created with kprobes under kernel 6.12-rc7. Upon > > receiving an ICMPv6 packet indicating an MTU change, exception dst > > 0xff3027eec766c100 is created and inserted into the IPv6 exception table: > > 3651.126884: rt6_insert_exception: (rt6_insert_exception+0x0/0x2b0) dst=0xff3027eec766c100 rcuref=0 > > 3651.126889: <stack trace> > > => rt6_insert_exception+0x5/0x2b0 > > => __ip6_rt_update_pmtu+0x1ef/0x380 > > => inet6_csk_update_pmtu+0x4b/0x90 > > => tcp_v6_mtu_reduced.part.22+0x34/0xc0 > > The exception dst is used to route packets: > > 3651.126902: inet6_csk_route_socket: (inet6_csk_update_pmtu+0x58/0x90 <- inet6_csk_route_socket) dst=0xff3027eec766c100 > > At this point, the connection has been severed and TCP starts retransmissions: > > 3652.349466: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 > > 3652.349497: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100 > > 3652.769469: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 > > 3652.769495: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100 > > 3653.596135: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 > > 3653.596156: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100 > > 3655.249465: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 > > 3655.249490: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100 > > 3658.689463: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 > > When ip6_negative_advice() is executed the refcount is 2 - the increment > > made by dst_init() and the increment made by the socket: > > 3658.689475: ip6_negative_advice: (ip6_negative_advice+0x0/0xa0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 dst=0xff3027eec766c100 rcuref=1 > > This is the result of dst_hold() and sk_dst_reset() in ip6_negative_advice(): > > 3658.689477: dst_release: (dst_release+0x0/0x80) dst=0xff3027eec766c100 rcuref=2 > > 3658.689498: rt6_remove_exception_rt: (rt6_remove_exception_rt+0x0/0xa0) dst=0xff3027eec766c100 rcuref=1 > > 3658.689501: rt6_remove_exception: (rt6_remove_exception.part.58+0x0/0xe0) dst=0xff3027eec766c100 rcuref=1 > > The refcount of dst 0xff3027eec766c100 is decremented by 1 as a result of > > removing the exception from the exception table with > > rt6_remove_exception_rt(): > > 3658.689505: dst_release: (dst_release+0x0/0x80) dst=0xff3027eec766c100 rcuref=1 > > The retransmissions continue without the exception dst being used for > > routing packets: > > 3662.352796: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef00cab780 sk=0xff3027eeeb1f0000 > > 3662.769470: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef00cab780 sk=0xff3027eeeb1f0000 > > 3663.596132: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef00cab780 sk=0xff3027eeeb1f0000 > > > > The ip6_dst_destroy() function was instrumented but there was no entry for > > dst 0xff3027eec766c100 in the trace even long after the ns0 and ns2 net > > namespaces had been destroyed. The refcount made by the socket was never > > released. The refcount of the dst is decremented in sk_dst_reset() but > > that decrement is counteracted by a dst_hold() intentionally placed just > > before the sk_dst_reset() in ip6_negative_advice(). The probem is that > > sockets that keep a reference to a dst in the sk_dst_cache member > > increment the refcount of the dst by 1. This is apparent in the following > > code paths: > > * When ip6_route_output_flags() finds a dst that is then stored in > > sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket() > > * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache > > Provided the dst is not kept in the sk_dst_cache member of another socket, > > there is no other object tied to the dst (the socket lost its reference > > and the dst is no longer in the exception table) and the dst becomes a > > leaked object after ip6_negative_advice() has finished. > > > > As a result of this dst leak, an unbalanced refcount is reported for the > > loopback device of a net namespace being destroyed under kernels that do > > not contain e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev"): > > unregister_netdevice: waiting for lo to become free. Usage count = 2 > > > > Fix the dst leak by removing the dst_hold() in ip6_negative_advice(). The > > patch that introduced the dst_hold() in ip6_negative_advice() was > > 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22 > > merely refactored the code with regards to the dst refcount so the issue > > was present even before 92f1655aa2b22. The bug was introduced in > > 54c1a859efd9f ("ipv6: Don't drop cache route entry unless timer actually > > expired.") where the expired cached route is deleted, the sk_dst_cache > > member of the socket is set to NULL by calling dst_negative_advice() but > > the refcount belonging to the socket is left unbalanced. > > > > The IPv4 version - ipv4_negative_advice() - is not affected by this bug. A > > nexthop exception is created and the dst associated with the socket fails > > a check after the ICMPv6 packet indicating an MTU change has been > > received. When the TCP connection times out ipv4_negative_advice() merely > > resets the sk_dst_cache of the socket while decrementing the refcount of > > the exception dst. Then, the expired nexthop exception is deleted along > > with its routes in find_exception() while TCP tries to retransmit the same > > packet again. > > > > Fixes: 92f1655aa2b22 ("net: fix __dst_negative_advice() race") > > Fixes: 54c1a859efd9f ("ipv6: Don't drop cache route entry unless timer actually expired.") > > > > Signed-off-by: Jiri Wiesner <jwiesner@suse.de> > > The patch makes sense to me, but the changelog is very hard to follow, I see. Thanks for the feedback. > please: > - move the testing code to a new, specific self-tests (possibly tuning > the timeouts to a [much] shorter runtime) I have been looking into turning the steps to reproduce the issue in a selftest script. I found out it is serious business and not something done in an hour, which made me question the need for creating a test for this issue. AFAIK, it is not common practice to supply a test script for every fix merged into the kernel. I could reduce the runtime to roughly 15 seconds but I think those 15 seconds would be wasted for testing an issue that may never come back unless someone decides to redesign sk_dst_reset() or rt6_remove_exception() in a rather disruptive way that would result in an unbalanced dst refcount again. I would much rather just drop the steps to reproduce the issue from the changelog. The commands themselves do not make the issue obvious unless someone runs them and uses a tracing approach to log what happens to dsts. > - clearly separate the probe logs from the describing commenting text > - try to be slightly a bit less verbose > > Additionally please solve a formal problem above: there should be no > white lines in the tag area between the Fixes and SoB tags. I will send another patch implementing these suggestions. J.
diff --git a/net/ipv6/route.c b/net/ipv6/route.c index b4251915585f..b70267c8d251 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk, if (rt->rt6i_flags & RTF_CACHE) { rcu_read_lock(); if (rt6_check_expired(rt)) { - /* counteract the dst_release() in sk_dst_reset() */ - dst_hold(dst); sk_dst_reset(sk); - rt6_remove_exception_rt(rt); } rcu_read_unlock();
Dst objects get leaked in ip6_negative_advice() when this function is executed for an expired IPv6 route located in the exception table. There are several conditions that must be fulfilled for the leak to occur: * an ICMPv6 packet indicating a change of the MTU for the path is received * a TCP connection that uses the exception dst for routing packets must start timing out so that TCP begins retransmissions * after the exception dst expires, the FIB6 garbage collector must not run before TCP executes ip6_negative_advice() for the expired exception dst The following steps reproduce the issue: ip link add veth1 mtu 65535 type veth peer veth0 mtu 65535 ip netns add ns0 ip link set veth1 netns ns0 ip addr add fd00::1/24 dev veth0 ip -n ns0 addr add fd00::2/24 dev veth1 ip link set up dev veth0 ip -n ns0 link set up dev lo ip -n ns0 link set up dev veth1 ip -n ns0 route add default via fd00::1 dev veth1 ip link add veth3 mtu 65535 type veth peer veth2 mtu 65535 ip netns add ns2 ip link set veth3 netns ns2 ip addr add fd02::1/24 dev veth2 ip -n ns2 addr add fd02::2/24 dev veth3 ip link set up dev veth2 ip -n ns2 link set up dev lo ip -n ns2 link set up dev veth3 ip -n ns2 route add default via fd02::1 dev veth3 ip netns exec ns0 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires" ip netns exec ns0 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval" sleep 30 ip6tables -A FORWARD -i veth0 -d fd02::/24 -j ACCEPT ip6tables -A FORWARD -i veth2 -d fd00::/24 -j ACCEPT echo 1 > /proc/sys/net/ipv6/conf/all/forwarding (ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$! sleep 1 dd if=/dev/zero bs=1M | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$! sleep 1 ip link set veth2 mtu 2000 sleep 1 ip6tables -D FORWARD -i veth2 -d fd00::/24 -j ACCEPT ip6tables -A FORWARD -i veth2 -d fd00::/24 -j DROP sleep 10 kill $clnt $serv wait $serv ip6tables -D FORWARD -i veth0 -d fd02::/24 -j ACCEPT ip6tables -D FORWARD -i veth2 -d fd00::/24 -j DROP ip -n ns0 link set down dev lo ip -n ns0 link set down dev veth1 ip -n ns0 link delete dev veth1 ip netns delete ns0 ip -n ns2 link set down dev lo ip -n ns2 link set down dev veth3 ip -n ns2 link delete dev veth3 ip netns delete ns2 This trace has been created with kprobes under kernel 6.12-rc7. Upon receiving an ICMPv6 packet indicating an MTU change, exception dst 0xff3027eec766c100 is created and inserted into the IPv6 exception table: 3651.126884: rt6_insert_exception: (rt6_insert_exception+0x0/0x2b0) dst=0xff3027eec766c100 rcuref=0 3651.126889: <stack trace> => rt6_insert_exception+0x5/0x2b0 => __ip6_rt_update_pmtu+0x1ef/0x380 => inet6_csk_update_pmtu+0x4b/0x90 => tcp_v6_mtu_reduced.part.22+0x34/0xc0 The exception dst is used to route packets: 3651.126902: inet6_csk_route_socket: (inet6_csk_update_pmtu+0x58/0x90 <- inet6_csk_route_socket) dst=0xff3027eec766c100 At this point, the connection has been severed and TCP starts retransmissions: 3652.349466: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 3652.349497: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100 3652.769469: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 3652.769495: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100 3653.596135: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 3653.596156: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100 3655.249465: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 3655.249490: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100 3658.689463: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 When ip6_negative_advice() is executed the refcount is 2 - the increment made by dst_init() and the increment made by the socket: 3658.689475: ip6_negative_advice: (ip6_negative_advice+0x0/0xa0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 dst=0xff3027eec766c100 rcuref=1 This is the result of dst_hold() and sk_dst_reset() in ip6_negative_advice(): 3658.689477: dst_release: (dst_release+0x0/0x80) dst=0xff3027eec766c100 rcuref=2 3658.689498: rt6_remove_exception_rt: (rt6_remove_exception_rt+0x0/0xa0) dst=0xff3027eec766c100 rcuref=1 3658.689501: rt6_remove_exception: (rt6_remove_exception.part.58+0x0/0xe0) dst=0xff3027eec766c100 rcuref=1 The refcount of dst 0xff3027eec766c100 is decremented by 1 as a result of removing the exception from the exception table with rt6_remove_exception_rt(): 3658.689505: dst_release: (dst_release+0x0/0x80) dst=0xff3027eec766c100 rcuref=1 The retransmissions continue without the exception dst being used for routing packets: 3662.352796: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef00cab780 sk=0xff3027eeeb1f0000 3662.769470: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef00cab780 sk=0xff3027eeeb1f0000 3663.596132: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef00cab780 sk=0xff3027eeeb1f0000 The ip6_dst_destroy() function was instrumented but there was no entry for dst 0xff3027eec766c100 in the trace even long after the ns0 and ns2 net namespaces had been destroyed. The refcount made by the socket was never released. The refcount of the dst is decremented in sk_dst_reset() but that decrement is counteracted by a dst_hold() intentionally placed just before the sk_dst_reset() in ip6_negative_advice(). The probem is that sockets that keep a reference to a dst in the sk_dst_cache member increment the refcount of the dst by 1. This is apparent in the following code paths: * When ip6_route_output_flags() finds a dst that is then stored in sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket() * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache Provided the dst is not kept in the sk_dst_cache member of another socket, there is no other object tied to the dst (the socket lost its reference and the dst is no longer in the exception table) and the dst becomes a leaked object after ip6_negative_advice() has finished. As a result of this dst leak, an unbalanced refcount is reported for the loopback device of a net namespace being destroyed under kernels that do not contain e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev"): unregister_netdevice: waiting for lo to become free. Usage count = 2 Fix the dst leak by removing the dst_hold() in ip6_negative_advice(). The patch that introduced the dst_hold() in ip6_negative_advice() was 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22 merely refactored the code with regards to the dst refcount so the issue was present even before 92f1655aa2b22. The bug was introduced in 54c1a859efd9f ("ipv6: Don't drop cache route entry unless timer actually expired.") where the expired cached route is deleted, the sk_dst_cache member of the socket is set to NULL by calling dst_negative_advice() but the refcount belonging to the socket is left unbalanced. The IPv4 version - ipv4_negative_advice() - is not affected by this bug. A nexthop exception is created and the dst associated with the socket fails a check after the ICMPv6 packet indicating an MTU change has been received. When the TCP connection times out ipv4_negative_advice() merely resets the sk_dst_cache of the socket while decrementing the refcount of the exception dst. Then, the expired nexthop exception is deleted along with its routes in find_exception() while TCP tries to retransmit the same packet again. Fixes: 92f1655aa2b22 ("net: fix __dst_negative_advice() race") Fixes: 54c1a859efd9f ("ipv6: Don't drop cache route entry unless timer actually expired.") Signed-off-by: Jiri Wiesner <jwiesner@suse.de> --- net/ipv6/route.c | 3 --- 1 file changed, 3 deletions(-)