Message ID | 20240930180916.GA24637@incl (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC] ipv6: route: release reference of dsts cached in sockets | expand |
On Mon, Sep 30, 2024 at 8:09 PM Jiri Wiesner <jwiesner@suse.de> wrote: > > An unbalanced refcount was reported for the loopback device of a net > namespace being destroyed: > unregister_netdevice: waiting for lo to become free. Usage count = 2 > > Analysis revealed that the IPv6 net device corresponding to the loopback > device did not get destroyed (in6_dev_finish_destroy() was not called). > The IPv6 loopback device had an unbalaced refcount: > net dev 73da100 lo refcount 1 > Operation Count Balancing Operation Count > hold __ipv6_dev_mc_inc 2 ma_put 2 > addrconf_ifdown 1 0 unbalanced > hold fib6_nh_init 2 fib6_nh_init 2 > put inet6_ifa_finish_destroy 1 ipv6_add_addr 1 > ip6_dst_destroy 90 0 unbalanced > ip6_dst_ifdown 90 0 unbalanced > hold ip6_route_dev_notify 6 ip6_route_dev_notify 6 > hold ipv6_add_addr 1 inet6_ifa_finish_destroy 1 > put ma_put 2 __ipv6_dev_mc_inc 2 > hold ndisc_netdev_event 2 ndisc_netdev_event 2 > rt6_disable_ip 1 0 unbalanced > > The refcount of addrconf_ifdown() balances the refcount increment in > ipv6_add_dev(), which had no corresponding trace entry. The > rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the > looback device, and the ip6_dst_destroy() entries were put operations. One > refcount decrement in ip6_dst_destroy() was not executed. At this point, a > hash was implemented in the debug kernel to hold the changes of the > refcount of dst objects per namespace. The trace for the dst object that > did not decrement the IPv6 refcount of loopback follows: > > Function Parent Op Net Device Dst Refcount Diff > ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1 > ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1 > dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1 > ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84 > dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1 > dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20 > dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29 > dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34 > dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1 > > The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely > indicates that the net device of the dst object was changed to > blackhole_dev and the IPv6 refcount was transferred onto the loopback > device. There was no ip6_dst_destroy() trace entry, which means the dst > object was not destroyed. There were 86 hold operations but only 85 put > operations so the dst object was not destroyed because the refcount of the > dst object was unbalanced. > > The problem is that the refcount sums are ambiguous. The most probable > explanation is this: The dst object was a route for an IPv6 TCP connection > that kept timing out. Sometimes, the process closed the socket, which > corresponds to the refcount decrements of the > dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit > timer reset the dst of the sockets, which corresponds to the > dst_release()/tcp_retransmit_timer() entries. I am unsure about the > dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets > skb->_skb_refdst with SKB_DST_NOREF. > > The feature that sets the above trace apart from all the other dst traces > is the execution of ip6_negative_advice() for a cached and also expired > dst object in the exception table. The cached and expired dst object has > its refcount set to at least 2 before executing rt6_remove_exception_rt() > found in ip6_negative_advice(). One decrement happens in > rt6_remove_exception() after the dst object has been removed from the > exception table. The other decrement happens in sk_dst_reset() but that > one is counteracted by a dst_hold() intentionally placed just before the > sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that > keeps a reference to a dst in its sk_dst_cache member increments the > refcount of the dst by 1. This is apparent in the following code paths: > * When ip6_route_output_flags() finds a dst that is then stored in > sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket(). > * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache > Provided the dst is not kept in the sk_dst_cache member of another socket, > there is no other object tied to the dst (the socket lost its reference > and the dst is no longer in the exception table) and the dst becomes a > leaked object after ip6_negative_advice() finishes. This leak then > precludes the net namespace from being destroyed. > > The patch that introduced the dst_hold() in ip6_negative_advice() was > 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22 > only refactored the code with regards to the dst refcount so the issue was > present even before 92f1655aa2b22. > > Signed-off-by: Jiri Wiesner <jwiesner@suse.de> > --- > At the moment, I am sending this as an RFC because I am not able to > reproduce the issue in-house. The customer that encountered the issue is > currently running tests. For the customer's testing, I fixed the issue > with a kprobe module that calls dst_release() right after > rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not > quite the same as the change proposed below. > > net/ipv6/route.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index b4251915585f..b70267c8d251 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk, > if (rt->rt6i_flags & RTF_CACHE) { > rcu_read_lock(); > if (rt6_check_expired(rt)) { > - /* counteract the dst_release() in sk_dst_reset() */ > - dst_hold(dst); > sk_dst_reset(sk); > - > rt6_remove_exception_rt(rt); > } > rcu_read_unlock(); > -- Interesting, what kernel version is your customer using ? I think that with recent kernels (after 5.18), we do not see issues because of the use of blackhole_netdev instead of loopback.
On Mon, Sep 30, 2024 at 08:27:50PM +0200, Eric Dumazet wrote: > On Mon, Sep 30, 2024 at 8:09 PM Jiri Wiesner <jwiesner@suse.de> wrote: > > > > An unbalanced refcount was reported for the loopback device of a net > > namespace being destroyed: > > unregister_netdevice: waiting for lo to become free. Usage count = 2 > > > > Analysis revealed that the IPv6 net device corresponding to the loopback > > device did not get destroyed (in6_dev_finish_destroy() was not called). > > The IPv6 loopback device had an unbalaced refcount: > > net dev 73da100 lo refcount 1 > > Operation Count Balancing Operation Count > > hold __ipv6_dev_mc_inc 2 ma_put 2 > > addrconf_ifdown 1 0 unbalanced > > hold fib6_nh_init 2 fib6_nh_init 2 > > put inet6_ifa_finish_destroy 1 ipv6_add_addr 1 > > ip6_dst_destroy 90 0 unbalanced > > ip6_dst_ifdown 90 0 unbalanced > > hold ip6_route_dev_notify 6 ip6_route_dev_notify 6 > > hold ipv6_add_addr 1 inet6_ifa_finish_destroy 1 > > put ma_put 2 __ipv6_dev_mc_inc 2 > > hold ndisc_netdev_event 2 ndisc_netdev_event 2 > > rt6_disable_ip 1 0 unbalanced > > > > The refcount of addrconf_ifdown() balances the refcount increment in > > ipv6_add_dev(), which had no corresponding trace entry. The > > rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the > > looback device, and the ip6_dst_destroy() entries were put operations. One > > refcount decrement in ip6_dst_destroy() was not executed. At this point, a > > hash was implemented in the debug kernel to hold the changes of the > > refcount of dst objects per namespace. The trace for the dst object that > > did not decrement the IPv6 refcount of loopback follows: > > > > Function Parent Op Net Device Dst Refcount Diff > > ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1 > > ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1 > > dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1 > > ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84 > > dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1 > > dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20 > > dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29 > > dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34 > > dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1 > > > > The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely > > indicates that the net device of the dst object was changed to > > blackhole_dev and the IPv6 refcount was transferred onto the loopback > > device. There was no ip6_dst_destroy() trace entry, which means the dst > > object was not destroyed. There were 86 hold operations but only 85 put > > operations so the dst object was not destroyed because the refcount of the > > dst object was unbalanced. > > > > The problem is that the refcount sums are ambiguous. The most probable > > explanation is this: The dst object was a route for an IPv6 TCP connection > > that kept timing out. Sometimes, the process closed the socket, which > > corresponds to the refcount decrements of the > > dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit > > timer reset the dst of the sockets, which corresponds to the > > dst_release()/tcp_retransmit_timer() entries. I am unsure about the > > dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets > > skb->_skb_refdst with SKB_DST_NOREF. > > > > The feature that sets the above trace apart from all the other dst traces > > is the execution of ip6_negative_advice() for a cached and also expired > > dst object in the exception table. The cached and expired dst object has > > its refcount set to at least 2 before executing rt6_remove_exception_rt() > > found in ip6_negative_advice(). One decrement happens in > > rt6_remove_exception() after the dst object has been removed from the > > exception table. The other decrement happens in sk_dst_reset() but that > > one is counteracted by a dst_hold() intentionally placed just before the > > sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that > > keeps a reference to a dst in its sk_dst_cache member increments the > > refcount of the dst by 1. This is apparent in the following code paths: > > * When ip6_route_output_flags() finds a dst that is then stored in > > sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket(). > > * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache > > Provided the dst is not kept in the sk_dst_cache member of another socket, > > there is no other object tied to the dst (the socket lost its reference > > and the dst is no longer in the exception table) and the dst becomes a > > leaked object after ip6_negative_advice() finishes. This leak then > > precludes the net namespace from being destroyed. > > > > The patch that introduced the dst_hold() in ip6_negative_advice() was > > 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22 > > only refactored the code with regards to the dst refcount so the issue was > > present even before 92f1655aa2b22. > > > > Signed-off-by: Jiri Wiesner <jwiesner@suse.de> > > --- > > At the moment, I am sending this as an RFC because I am not able to > > reproduce the issue in-house. The customer that encountered the issue is > > currently running tests. For the customer's testing, I fixed the issue > > with a kprobe module that calls dst_release() right after > > rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not > > quite the same as the change proposed below. > > > > net/ipv6/route.c | 3 --- > > 1 file changed, 3 deletions(-) > > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > > index b4251915585f..b70267c8d251 100644 > > --- a/net/ipv6/route.c > > +++ b/net/ipv6/route.c > > @@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk, > > if (rt->rt6i_flags & RTF_CACHE) { > > rcu_read_lock(); > > if (rt6_check_expired(rt)) { > > - /* counteract the dst_release() in sk_dst_reset() */ > > - dst_hold(dst); > > sk_dst_reset(sk); > > - > > rt6_remove_exception_rt(rt); > > } > > rcu_read_unlock(); > > -- > > Interesting, what kernel version is your customer using ? It is a distribution kernel (SLES 15 SP5) that is based on 5.14. The newest SLES service pack, SP6, has a 6.4-based kernel. But the customer has yet to move to that one. > I think that with recent kernels (after 5.18), we do not see issues > because of the use of blackhole_netdev > instead of loopback. Thank you very much for mentioning that. I will be backporting these patches to SUSE's 5.14-based kernels: > 9cc341286e99 ("dn_route: set rt neigh to blackhole_netdev instead of loopback_dev in ifdown") > 4d33ab08c0af ("xfrm: set dst dev to blackhole_netdev instead of loopback_dev in ifdown") > dd263a8cb194 ("ipv6: blackhole_netdev needs snmp6 counters") > e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev") These patches will not resolve the dst leak but being able to destroy net namespaces at the expense of sustaining infrequently occurring dst leaks is a favourable trade. The testing results (debugging SLES15 SP5 kernel plus the kprobe module calling dst_release() in probe_ip6_negative_advice()) came back and the kernel log indicates negative overflow of the dst refcount. The first machine: > [ 9366.700330] dst_release underflow > [ 9366.700346] WARNING: CPU: 42 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0 > [ 9366.701492] Call Trace: > [ 9366.701497] <IRQ> > [ 9366.701506] probe_ip6_negative_advice+0x12/0x30 [probe 0c57153a505ed7cc20e2c5db9522a1a524ec4f56] > [ 9366.701514] opt_pre_handler+0x3d/0x70 > [ 9366.701528] optimized_callback+0x64/0x90 > [ 9366.701538] 0xffffffffc0476032 > [ 9366.701548] ? ip6_negative_advice+0xdf/0x180 > [ 9366.701558] ? tcp_retransmit_timer+0x658/0x930 > [ 9366.701595] ? tcp_write_timer_handler+0xba/0x1f0 > [ 9366.701601] ? tcp_write_timer+0x211/0x260 > [ 9366.701624] ? call_timer_fn+0x27/0x130 > [ 9366.701631] ? run_timer_softirq+0x443/0x480 > [ 9366.701645] ? __do_softirq+0xd2/0x2c0 > [ 9366.701654] ? irq_exit_rcu+0xa4/0xc0 > [ 9366.701661] ? sysvec_apic_timer_interrupt+0x50/0x90 > [ 9366.701667] </IRQ> > [ 9366.701744] dst_release: dst:00000000ac3b9120 refcnt:-1 > [ 9366.701757] dst_release: dst:00000000ac3b9120 refcnt:-2 > [ 9382.832164] dst_release: dst:00000000ac3b9120 refcnt:-3 > [ 9659.561679] dst_release: dst:00000000494633c6 refcnt:-1 Obviously, the refcount of the cached and expired dst, dst:00000000ac3b9120, was 1 before probe_ip6_negative_advice() was called from tcp_write_timeout()/tcp_retransmit_timer(). So, my supposition that cached dsts being passed to probe_ip6_negative_advice() have a refcount of at least 2 was wrong. The second machine: [11479.272704] dst_release underflow [11479.272718] WARNING: CPU: 52 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0 [11479.272924] Call Trace: [11479.272928] <IRQ> [11479.272932] tcp_retransmit_timer+0x47c/0x930 [11479.272953] tcp_write_timer_handler+0xba/0x1f0 [11479.272956] tcp_write_timer+0x211/0x260 [11479.272962] call_timer_fn+0x27/0x130 [11479.272966] run_timer_softirq+0x443/0x480 [11479.272973] __do_softirq+0xd2/0x2c0 [11479.272980] irq_exit_rcu+0xa4/0xc0 [11479.272985] sysvec_apic_timer_interrupt+0x50/0x90 [11479.272989] </IRQ> [11479.273044] dst_release: dst:00000000ff7e6845 refcnt:-1 [11681.380331] dst_release: dst:00000000893d8a2d refcnt:-1 In this case, the refcount of dst:00000000ff7e6845 was have been 2 before probe_ip6_negative_advice() so the dst_release() called from the kprobe handler decremented the refcount to 0. No warning was printed. There probably was another socket having the dst stored in sk_dst_cache, to which the refcount of 1 belonged. When the second __sk_dst_reset() was called in tcp_retransmit_timer(), which was the tcp_retransmit_timer+0x47c site, the refcount dropped below 0. I am afraid this patch is misguided. I would still like to find the source of the dst leak but I am also running out of time which the customer is willing to invest into investigating this issue.
On Tue, Oct 1, 2024 at 11:26 AM Jiri Wiesner <jwiesner@suse.de> wrote: > > On Mon, Sep 30, 2024 at 08:27:50PM +0200, Eric Dumazet wrote: > > On Mon, Sep 30, 2024 at 8:09 PM Jiri Wiesner <jwiesner@suse.de> wrote: > > > > > > An unbalanced refcount was reported for the loopback device of a net > > > namespace being destroyed: > > > unregister_netdevice: waiting for lo to become free. Usage count = 2 > > > > > > Analysis revealed that the IPv6 net device corresponding to the loopback > > > device did not get destroyed (in6_dev_finish_destroy() was not called). > > > The IPv6 loopback device had an unbalaced refcount: > > > net dev 73da100 lo refcount 1 > > > Operation Count Balancing Operation Count > > > hold __ipv6_dev_mc_inc 2 ma_put 2 > > > addrconf_ifdown 1 0 unbalanced > > > hold fib6_nh_init 2 fib6_nh_init 2 > > > put inet6_ifa_finish_destroy 1 ipv6_add_addr 1 > > > ip6_dst_destroy 90 0 unbalanced > > > ip6_dst_ifdown 90 0 unbalanced > > > hold ip6_route_dev_notify 6 ip6_route_dev_notify 6 > > > hold ipv6_add_addr 1 inet6_ifa_finish_destroy 1 > > > put ma_put 2 __ipv6_dev_mc_inc 2 > > > hold ndisc_netdev_event 2 ndisc_netdev_event 2 > > > rt6_disable_ip 1 0 unbalanced > > > > > > The refcount of addrconf_ifdown() balances the refcount increment in > > > ipv6_add_dev(), which had no corresponding trace entry. The > > > rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the > > > looback device, and the ip6_dst_destroy() entries were put operations. One > > > refcount decrement in ip6_dst_destroy() was not executed. At this point, a > > > hash was implemented in the debug kernel to hold the changes of the > > > refcount of dst objects per namespace. The trace for the dst object that > > > did not decrement the IPv6 refcount of loopback follows: > > > > > > Function Parent Op Net Device Dst Refcount Diff > > > ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1 > > > ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1 > > > dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1 > > > ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84 > > > dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1 > > > dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20 > > > dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29 > > > dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34 > > > dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1 > > > > > > The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely > > > indicates that the net device of the dst object was changed to > > > blackhole_dev and the IPv6 refcount was transferred onto the loopback > > > device. There was no ip6_dst_destroy() trace entry, which means the dst > > > object was not destroyed. There were 86 hold operations but only 85 put > > > operations so the dst object was not destroyed because the refcount of the > > > dst object was unbalanced. > > > > > > The problem is that the refcount sums are ambiguous. The most probable > > > explanation is this: The dst object was a route for an IPv6 TCP connection > > > that kept timing out. Sometimes, the process closed the socket, which > > > corresponds to the refcount decrements of the > > > dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit > > > timer reset the dst of the sockets, which corresponds to the > > > dst_release()/tcp_retransmit_timer() entries. I am unsure about the > > > dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets > > > skb->_skb_refdst with SKB_DST_NOREF. > > > > > > The feature that sets the above trace apart from all the other dst traces > > > is the execution of ip6_negative_advice() for a cached and also expired > > > dst object in the exception table. The cached and expired dst object has > > > its refcount set to at least 2 before executing rt6_remove_exception_rt() > > > found in ip6_negative_advice(). One decrement happens in > > > rt6_remove_exception() after the dst object has been removed from the > > > exception table. The other decrement happens in sk_dst_reset() but that > > > one is counteracted by a dst_hold() intentionally placed just before the > > > sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that > > > keeps a reference to a dst in its sk_dst_cache member increments the > > > refcount of the dst by 1. This is apparent in the following code paths: > > > * When ip6_route_output_flags() finds a dst that is then stored in > > > sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket(). > > > * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache > > > Provided the dst is not kept in the sk_dst_cache member of another socket, > > > there is no other object tied to the dst (the socket lost its reference > > > and the dst is no longer in the exception table) and the dst becomes a > > > leaked object after ip6_negative_advice() finishes. This leak then > > > precludes the net namespace from being destroyed. > > > > > > The patch that introduced the dst_hold() in ip6_negative_advice() was > > > 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22 > > > only refactored the code with regards to the dst refcount so the issue was > > > present even before 92f1655aa2b22. > > > > > > Signed-off-by: Jiri Wiesner <jwiesner@suse.de> > > > --- > > > At the moment, I am sending this as an RFC because I am not able to > > > reproduce the issue in-house. The customer that encountered the issue is > > > currently running tests. For the customer's testing, I fixed the issue > > > with a kprobe module that calls dst_release() right after > > > rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not > > > quite the same as the change proposed below. > > > > > > net/ipv6/route.c | 3 --- > > > 1 file changed, 3 deletions(-) > > > > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > > > index b4251915585f..b70267c8d251 100644 > > > --- a/net/ipv6/route.c > > > +++ b/net/ipv6/route.c > > > @@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk, > > > if (rt->rt6i_flags & RTF_CACHE) { > > > rcu_read_lock(); > > > if (rt6_check_expired(rt)) { > > > - /* counteract the dst_release() in sk_dst_reset() */ > > > - dst_hold(dst); > > > sk_dst_reset(sk); > > > - > > > rt6_remove_exception_rt(rt); > > > } > > > rcu_read_unlock(); > > > -- > > > > Interesting, what kernel version is your customer using ? > > It is a distribution kernel (SLES 15 SP5) that is based on 5.14. The newest SLES service pack, SP6, has a 6.4-based kernel. But the customer has yet to move to that one. > > > I think that with recent kernels (after 5.18), we do not see issues > > because of the use of blackhole_netdev > > instead of loopback. > > Thank you very much for mentioning that. I will be backporting these patches to SUSE's 5.14-based kernels: > > 9cc341286e99 ("dn_route: set rt neigh to blackhole_netdev instead of loopback_dev in ifdown") > > 4d33ab08c0af ("xfrm: set dst dev to blackhole_netdev instead of loopback_dev in ifdown") > > dd263a8cb194 ("ipv6: blackhole_netdev needs snmp6 counters") > > e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev") > These patches will not resolve the dst leak but being able to destroy net namespaces at the expense of sustaining infrequently occurring dst leaks is a favourable trade. > > The testing results (debugging SLES15 SP5 kernel plus the kprobe module calling dst_release() in probe_ip6_negative_advice()) came back and the kernel log indicates negative overflow of the dst refcount. The first machine: > > [ 9366.700330] dst_release underflow > > [ 9366.700346] WARNING: CPU: 42 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0 > > [ 9366.701492] Call Trace: > > [ 9366.701497] <IRQ> > > [ 9366.701506] probe_ip6_negative_advice+0x12/0x30 [probe 0c57153a505ed7cc20e2c5db9522a1a524ec4f56] > > [ 9366.701514] opt_pre_handler+0x3d/0x70 > > [ 9366.701528] optimized_callback+0x64/0x90 > > [ 9366.701538] 0xffffffffc0476032 > > [ 9366.701548] ? ip6_negative_advice+0xdf/0x180 > > [ 9366.701558] ? tcp_retransmit_timer+0x658/0x930 > > [ 9366.701595] ? tcp_write_timer_handler+0xba/0x1f0 > > [ 9366.701601] ? tcp_write_timer+0x211/0x260 > > [ 9366.701624] ? call_timer_fn+0x27/0x130 > > [ 9366.701631] ? run_timer_softirq+0x443/0x480 > > [ 9366.701645] ? __do_softirq+0xd2/0x2c0 > > [ 9366.701654] ? irq_exit_rcu+0xa4/0xc0 > > [ 9366.701661] ? sysvec_apic_timer_interrupt+0x50/0x90 > > [ 9366.701667] </IRQ> > > [ 9366.701744] dst_release: dst:00000000ac3b9120 refcnt:-1 > > [ 9366.701757] dst_release: dst:00000000ac3b9120 refcnt:-2 > > [ 9382.832164] dst_release: dst:00000000ac3b9120 refcnt:-3 > > [ 9659.561679] dst_release: dst:00000000494633c6 refcnt:-1 > Obviously, the refcount of the cached and expired dst, dst:00000000ac3b9120, was 1 before probe_ip6_negative_advice() was called from tcp_write_timeout()/tcp_retransmit_timer(). So, my supposition that cached dsts being passed to probe_ip6_negative_advice() have a refcount of at least 2 was wrong. > > The second machine: > [11479.272704] dst_release underflow > [11479.272718] WARNING: CPU: 52 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0 > [11479.272924] Call Trace: > [11479.272928] <IRQ> > [11479.272932] tcp_retransmit_timer+0x47c/0x930 > [11479.272953] tcp_write_timer_handler+0xba/0x1f0 > [11479.272956] tcp_write_timer+0x211/0x260 > [11479.272962] call_timer_fn+0x27/0x130 > [11479.272966] run_timer_softirq+0x443/0x480 > [11479.272973] __do_softirq+0xd2/0x2c0 > [11479.272980] irq_exit_rcu+0xa4/0xc0 > [11479.272985] sysvec_apic_timer_interrupt+0x50/0x90 > [11479.272989] </IRQ> > [11479.273044] dst_release: dst:00000000ff7e6845 refcnt:-1 > [11681.380331] dst_release: dst:00000000893d8a2d refcnt:-1 > In this case, the refcount of dst:00000000ff7e6845 was have been 2 before probe_ip6_negative_advice() so the dst_release() called from the kprobe handler decremented the refcount to 0. No warning was printed. There probably was another socket having the dst stored in sk_dst_cache, to which the refcount of 1 belonged. When the second __sk_dst_reset() was called in tcp_retransmit_timer(), which was the tcp_retransmit_timer+0x47c site, the refcount dropped below 0. > > I am afraid this patch is misguided. I would still like to find the source of the dst leak but I am also running out of time which the customer is willing to invest into investigating this issue. Is your kernel including this commit? commit 28044fc1d4953b07acec0da4d2fc4784c57ea6fb Author: Joanne Koong <joannelkoong@gmail.com> Date: Mon Aug 22 11:10:21 2022 -0700 net: Add a bhash2 table hashed by port and address After this commit, it seems in tcp_v6_connect(), the 'goto failure' may cause a dst leak.: dst = ip6_dst_lookup_flow(net, sk, &fl6, final_p); ... if (!saddr) { saddr = &fl6.saddr; err = inet_bhash2_update_saddr(sk, saddr, AF_INET6); if (err) goto failure; <--- } ... ip6_dst_store(sk, dst, NULL, NULL);
On Wed, Oct 02, 2024 at 04:27:55PM -0400, Xin Long wrote: > On Tue, Oct 1, 2024 at 11:26 AM Jiri Wiesner <jwiesner@suse.de> wrote: > > I am afraid this patch is misguided. I would still like to find the source of the dst leak but I am also running out of time which the customer is willing to invest into investigating this issue. > Is your kernel including this commit? > > commit 28044fc1d4953b07acec0da4d2fc4784c57ea6fb > Author: Joanne Koong <joannelkoong@gmail.com> > Date: Mon Aug 22 11:10:21 2022 -0700 > > net: Add a bhash2 table hashed by port and address > > After this commit, it seems in tcp_v6_connect(), the 'goto failure' > may cause a dst leak.: > > dst = ip6_dst_lookup_flow(net, sk, &fl6, final_p); > ... > if (!saddr) { > saddr = &fl6.saddr; > > err = inet_bhash2_update_saddr(sk, saddr, AF_INET6); > if (err) > goto failure; <--- > } > ... > ip6_dst_store(sk, dst, NULL, NULL); Thanks for pointing this out. 28044fc1d495 seems to be an interesting commit as far as the number of Fixes is concerned. The commit was not backported to the 5.14-based SLES kernels, for which the unbalaced refcount bug was reported. The commit is part of the 6.4-based SLES kernels so I will have to see if all the patches with Fixes tags have been backported. J.
On Thu, Oct 3, 2024 at 1:01 PM Jiri Wiesner <jwiesner@suse.de> wrote: > > On Wed, Oct 02, 2024 at 04:27:55PM -0400, Xin Long wrote: > > On Tue, Oct 1, 2024 at 11:26 AM Jiri Wiesner <jwiesner@suse.de> wrote: > > > I am afraid this patch is misguided. I would still like to find the source of the dst leak but I am also running out of time which the customer is willing to invest into investigating this issue. > > Is your kernel including this commit? > > > > commit 28044fc1d4953b07acec0da4d2fc4784c57ea6fb > > Author: Joanne Koong <joannelkoong@gmail.com> > > Date: Mon Aug 22 11:10:21 2022 -0700 > > > > net: Add a bhash2 table hashed by port and address > > > > After this commit, it seems in tcp_v6_connect(), the 'goto failure' > > may cause a dst leak.: > > > > dst = ip6_dst_lookup_flow(net, sk, &fl6, final_p); > > ... > > if (!saddr) { > > saddr = &fl6.saddr; > > > > err = inet_bhash2_update_saddr(sk, saddr, AF_INET6); > > if (err) > > goto failure; <--- > > } > > ... > > ip6_dst_store(sk, dst, NULL, NULL); > > Thanks for pointing this out. 28044fc1d495 seems to be an interesting commit as far as the number of Fixes is concerned. The commit was not backported to the 5.14-based SLES kernels, for which the unbalaced refcount bug was reported. The commit is part of the 6.4-based SLES kernels so I will have to see if all the patches with Fixes tags have been backported. > J. Hi, Jiri, We recently also encountered this 'unregister_netdevice: waiting for lo to become free. Usage count = X' problem on our customer env after backporting Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1] The commit looks correct to me, so I guess it may uncover some existing issues. As it took a very long time to get reproduced on our customer env, which made it impossible to debug. Also the issue existed even after disabling IPv6. It seems much easier to reproduce it on your customer env. So I'm wondering - Was the testing on your customer env related to IPv6 ? - Does the issue still exist after reverting the commit [1] ? Thanks.
On Sun, Oct 06, 2024 at 02:25:25PM -0400, Xin Long wrote: > We recently also encountered this > > 'unregister_netdevice: waiting for lo to become free. Usage count = X' > > problem on our customer env after backporting > > Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1] > > The commit looks correct to me, so I guess it may uncover some existing > issues. That is interesting. It seems possible to me. > As it took a very long time to get reproduced on our customer env, which > made it impossible to debug. Also the issue existed even after > disabling IPv6. AFAIK, the reproducibility of the issue hinges on net namespaces being created and destroyed often and TCP connections timing out in the namespaces. Traffic rate does not seem important. > It seems much easier to reproduce it on your customer env. So I'm wondering It takes roughly 24 hours but sometimes much more or much less to reproduce this. > - Was the testing on your customer env related to IPv6 ? Yes, it was. But I don't think the customer can disable IPv6. Their product is based on IPv6. > - Does the issue still exist after reverting the commit [1] ? I will ask the customer for a test. Also, the issue is not reproducible after e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev") was backported. J.
On Sun, Oct 06, 2024 at 02:25:25PM -0400, Xin Long wrote: > We recently also encountered this > > 'unregister_netdevice: waiting for lo to become free. Usage count = X' > > problem on our customer env after backporting > > Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1] > > The commit looks correct to me, so I guess it may uncover some existing > issues. > > As it took a very long time to get reproduced on our customer env, which > made it impossible to debug. Also the issue existed even after > disabling IPv6. > > It seems much easier to reproduce it on your customer env. So I'm wondering > > - Was the testing on your customer env related to IPv6 ? > - Does the issue still exist after reverting the commit [1] ? The customer tried reproducing the issue with 92f1655aa2b22 ("net: fix __dst_negative_advice() race") reverted and the issue appeared again. My next step was capturing more points within the stacktraces when dst refcounts are changed (I do not have full stack traces - the trace entries I store in the hash contain only the instruction pointer and the parent function). This is the trace for the leaked dst object: > dst ff1c4157413bf900 > alloc: 1 destroy: obj diff: 1 > hold ops: 5 put ops: 4 refcnt diff: 1 > Function Parent Op Net Device Dst Number of Calls > inet6_csk_route_socket+0x1c2/0x2d0: inet6_csk_update_pmtu+0x58/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > __ip6_rt_update_pmtu+0x183/0x3c0: inet6_csk_update_pmtu+0x4b/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > ip6_dst_lookup_flow+0x4f/0x1d0: inet6_csk_route_socket+0x198/0x2d0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > ip6_dst_ifdown+0x5/0x250: dst_dev_put+0x2d/0xd0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > tcp_v6_connect+0x37e/0x790: __inet_stream_connect+0x2c8/0x3a0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2 > ip6_dst_lookup_flow+0x4f/0x1d0: tcp_v6_connect+0x320/0x790: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2 > __dev_queue_xmit+0x1af/0xd20: ip6_finish_output2+0x1f1/0x6e0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675 > inet6_csk_xmit+0xa1/0x150: __tcp_transmit_skb+0x5f8/0xd40: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675 > Function Parent Op Net Device Dst Refcount Diff > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x215/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > ip6_negative_advice+0x66/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > dst_alloc+0x5e/0x180: ip6_dst_alloc+0x27/0x60: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x10f/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 2 > dst_release+0x32/0x140: ip6_negative_advice+0x137/0x2d0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > dst_release+0x32/0x140: inet_sock_destruct+0x146/0x1c0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 2 > dst_release+0x32/0x140: rt6_remove_exception.part.53+0x7f/0xe0: dst_put ffffffff90ee2a80 blackhole_dev ff1c4157413bf900 1 > Function Parent Op Net Dst Socket Number of Calls > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1 > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418e49bac280 1 > sk_setup_caps+0x5/0x200: inet6_csk_route_socket+0x23c/0x2d0: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418ea04f7200 1 The refcount of the dst was changed only when a dst_hold or dst_put operation was logged. The hold in dst_alloc() is balanced by the dst_release() in rt6_remove_exception(). The three holds in ip6_route_output_flags() belonged to the references held by sockets which were balanced by the dst_release() in inet_sock_destruct(). But that leaves one hold operation outstanding and unbalanced. There is a put operation for expired cached dst objects in ip6_negative_advice() but that one is intentionally balanced by a hold in the same function. Accoding to the sk_dst trace entries, three sockets held a reference to the leaked dst. All of these sockets were eventually distroyed so there was no socket leak causing the dst leak, which rules out one of the possbile hypotheses. The dst was stored in the socket object while a connect() sycall was being executed for two of the sockets. An IMCPv6 packet carrying a Packet Too Big message was received by the third socket, which lead to the execution of routines changing the MTU for the dst and storing a reference to the dst in the socket object. As for races, sockets are locked in inet_stream_connect, tcp_write_timer() and tcp_v6_err(), and the mutual exclusion prevents races when reference to dst objects are set and reset for sockets. The put operation executed from dst_release() in inet_sock_destruct() is not part of a critical section protected by mutual exclusion but it is guaranteed to run only after the refcount of the socket has reached zero. All the other code paths in the trace increment the refcount of the socket while they manipulate dsts (the connect() syscall uses the refcount incremented by the socket() syscall). This is a trace for the socket for which ip6_negative_advice() was executed for an expired cached dst: > socket ff1c4158c07aaf80 > alloc: 1 destroy: 1 obj diff: 0 > hold ops: 9 put ops: 9 refcnt diff: 0 > ip6_negative_advice+0xcd/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > tcp_retransmit_timer+0x50d/0xac0: tcp_write_timer_handler+0xba/0x1f0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 2 > ip6_negative_advice+0x169/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 3 > sk_setup_caps+0x5/0x200: inet6_sk_rebuild_header+0x1f9/0x2a0: sk_dst ff1c4151bcac1080 ff1c4153a762d500 ff1c4158c07aaf80 5 > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1 > sk_free+0x5/0x100: tcp_close+0x100/0x120: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > __sk_free+0x5/0x190: sk_free+0xe3/0x100: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > sk_destruct+0x5/0x150: __sk_free+0x83/0x190: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > sk_alloc+0x195/0x2a0: inet6_create+0xd5/0x450: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > sk_free+0x5/0x100: ip6_rcv_core.isra.25+0x27c/0x440: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 7 > sk_reset_timer+0x7b/0x130: tcp_connect+0x83f/0xe00: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > __tcp_close+0x15d/0x4c0: tcp_close+0x35/0x120: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > sock_init_data_uid+0x107/0x2d0: inet6_create+0xec/0x450: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > sk_reset_timer+0x7b/0x130: tcp_retransmit_timer+0x4e3/0xac0: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 6 > tcp_close+0x5/0x120: inet_release+0x3c/0x80: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > inet_csk_destroy_sock+0x90/0x1e0: __tcp_close+0x2ca/0x4c0: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > tcp_write_timer+0x61/0x260: call_timer_fn+0x27/0x130: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 7 Besides dst ff1c4157413bf900, this sockets held a reference to another dst - ff1c4153a762d500. Regardless of the order in which the dst references were held, the extra dst_hold() in ip6_negative_advice() would preserve the refcount increment owned by the socket while also resetting the sk_dst_cache member of the socket to NULL, rendering it impossible for the socket to decrement the refcount in the future. We see that the socket was destroyed but the sk_dst_cache member was most probably NULL at the time because all the dst references were reset in tcp_retransmit_timer() and ip6_negative_advice(). Based on this data, I went back to recheck the kprobe module that was meant to implement a dst_release() in ip6_negative_advice(). I found a mistake in the module that resulted in executing: > static void ip6_negative_advice(struct sock *sk, struct dst_entry *dst) > { > struct rt6_info *rt = (struct rt6_info *) dst; > if (rt->rt6i_flags & RTF_CACHE) { > rcu_read_lock(); > if (rt6_check_expired(rt)) { > /* counteract the dst_release() in sk_dst_reset() */ > dst_hold(dst); > sk_dst_reset(sk); > rt6_remove_exception_rt(rt); > } > ---> kprobe_calls_dst_release; > rcu_read_unlock(); > return; > } > sk_dst_reset(sk); > } instead of this: > static void ip6_negative_advice(struct sock *sk, struct dst_entry *dst) > { > struct rt6_info *rt = (struct rt6_info *) dst; > if (rt->rt6i_flags & RTF_CACHE) { > rcu_read_lock(); > if (rt6_check_expired(rt)) { > /* counteract the dst_release() in sk_dst_reset() */ > dst_hold(dst); > ---> kprobe_calls_dst_release; > sk_dst_reset(sk); > rt6_remove_exception_rt(rt); > } > rcu_read_unlock(); > return; > } > sk_dst_reset(sk); > } which clearly shows why the test with the module resulted in negative overflow of dst refcounts. I fixed the module and further testing showed the issue is no longer reproducible. I have managed to put together a minimal set of steps needed to reproduce the issue: ip link add veth1 mtu 65535 type veth peer veth0 mtu 65535 ip netns add ns0 ip link set veth1 netns ns0 ip addr add fd00::1/24 dev veth0 ip -n ns0 addr add fd00::2/24 dev veth1 ip link set up dev veth0 ip -n ns0 link set up dev lo ip -n ns0 link set up dev veth1 ip -n ns0 route add default via fd00::1 dev veth1 ip link add veth3 mtu 65535 type veth peer veth2 mtu 65535 ip netns add ns2 ip link set veth3 netns ns2 ip addr add fd02::1/24 dev veth2 ip -n ns2 addr add fd02::2/24 dev veth3 ip link set up dev veth2 ip -n ns2 link set up dev lo ip -n ns2 link set up dev veth3 ip -n ns2 route add default via fd02::1 dev veth3 ip netns exec ns0 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires" #ip netns exec ns2 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires" ip netns exec ns0 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval" #ip netns exec ns2 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval" sleep 30 ip6tables -F ip6tables -A FORWARD -i veth0 -d fd02::/24 -j ACCEPT ip6tables -A FORWARD -i veth2 -d fd00::/24 -j ACCEPT echo 1 > /proc/sys/net/ipv6/conf/all/forwarding ip6tables -L -v (ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$! sleep 1 dd if=/dev/zero bs=1M count=100 | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$! sleep 1 kill $clnt $serv wait $serv (ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$! sleep 1 dd if=/dev/zero bs=1M | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$! sleep 1 ip link set veth2 mtu 2000 sleep 1 ip6tables -D FORWARD -i veth2 -d fd00::/24 -j ACCEPT ip6tables -A FORWARD -i veth2 -d fd00::/24 -j DROP sleep 10 kill $clnt $serv wait $serv ip -n ns0 link set down dev lo ip -n ns0 link set down dev veth1 ip -n ns0 link delete dev veth1 ip netns delete ns0 ip -n ns2 link set down dev lo ip -n ns2 link set down dev veth3 ip -n ns2 link delete dev veth3 ip netns delete ns2 I was able to reproduce the dst leak under 6.12-rc7. I will submit a fix shortly.
On Tue, Nov 12, 2024 at 3:57 AM Jiri Wiesner <jwiesner@suse.de> wrote: > > On Sun, Oct 06, 2024 at 02:25:25PM -0400, Xin Long wrote: > > We recently also encountered this > > > > 'unregister_netdevice: waiting for lo to become free. Usage count = X' > > > > problem on our customer env after backporting > > > > Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1] > > > > The commit looks correct to me, so I guess it may uncover some existing > > issues. > > > > As it took a very long time to get reproduced on our customer env, which > > made it impossible to debug. Also the issue existed even after > > disabling IPv6. > > > > It seems much easier to reproduce it on your customer env. So I'm wondering > > > > - Was the testing on your customer env related to IPv6 ? > > - Does the issue still exist after reverting the commit [1] ? > > The customer tried reproducing the issue with 92f1655aa2b22 ("net: fix __dst_negative_advice() race") reverted and the issue appeared again. My next step was capturing more points within the stacktraces when dst refcounts are changed (I do not have full stack traces - the trace entries I store in the hash contain only the instruction pointer and the parent function). This is the trace for the leaked dst object: > > dst ff1c4157413bf900 > > alloc: 1 destroy: obj diff: 1 > > hold ops: 5 put ops: 4 refcnt diff: 1 > > Function Parent Op Net Device Dst Number of Calls > > inet6_csk_route_socket+0x1c2/0x2d0: inet6_csk_update_pmtu+0x58/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > __ip6_rt_update_pmtu+0x183/0x3c0: inet6_csk_update_pmtu+0x4b/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > ip6_dst_lookup_flow+0x4f/0x1d0: inet6_csk_route_socket+0x198/0x2d0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > ip6_dst_ifdown+0x5/0x250: dst_dev_put+0x2d/0xd0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > tcp_v6_connect+0x37e/0x790: __inet_stream_connect+0x2c8/0x3a0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2 > > ip6_dst_lookup_flow+0x4f/0x1d0: tcp_v6_connect+0x320/0x790: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2 > > __dev_queue_xmit+0x1af/0xd20: ip6_finish_output2+0x1f1/0x6e0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675 > > inet6_csk_xmit+0xa1/0x150: __tcp_transmit_skb+0x5f8/0xd40: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675 > > Function Parent Op Net Device Dst Refcount Diff > > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x215/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > ip6_negative_advice+0x66/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > dst_alloc+0x5e/0x180: ip6_dst_alloc+0x27/0x60: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x10f/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 2 > > dst_release+0x32/0x140: ip6_negative_advice+0x137/0x2d0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > dst_release+0x32/0x140: inet_sock_destruct+0x146/0x1c0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 2 > > dst_release+0x32/0x140: rt6_remove_exception.part.53+0x7f/0xe0: dst_put ffffffff90ee2a80 blackhole_dev ff1c4157413bf900 1 > > Function Parent Op Net Dst Socket Number of Calls > > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1 > > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418e49bac280 1 > > sk_setup_caps+0x5/0x200: inet6_csk_route_socket+0x23c/0x2d0: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418ea04f7200 1 > > The refcount of the dst was changed only when a dst_hold or dst_put operation was logged. The hold in dst_alloc() is balanced by the dst_release() in rt6_remove_exception(). The three holds in ip6_route_output_flags() belonged to the references held by sockets which were balanced by the dst_release() in inet_sock_destruct(). But that leaves one hold operation outstanding and unbalanced. There is a put operation for expired cached dst objects in ip6_negative_advice() but that one is intentionally balanced by a hold in the same function. > This makes sense to me. The code prior to 92f1655aa2b22 ("net: fix __dst_negative_advice() race") had no longer been able to release the cached dst for the reference held by socket in ip6_negative_advice() since rt6_remove_exception_rt() is called there. Hi, David Ahern, Can you confirm this? Thanks. > Accoding to the sk_dst trace entries, three sockets held a reference to the leaked dst. All of these sockets were eventually distroyed so there was no socket leak causing the dst leak, which rules out one of the possbile hypotheses. The dst was stored in the socket object while a connect() sycall was being executed for two of the sockets. An IMCPv6 packet carrying a Packet Too Big message was received by the third socket, which lead to the execution of routines changing the MTU for the dst and storing a reference to the dst in the socket object. As for races, sockets are locked in inet_stream_connect, tcp_write_timer() and tcp_v6_err(), and the mutual exclusion prevents races when reference to dst objects are set and reset for sockets. The put operation executed from dst_release() in inet_sock_destruct() is not part of a critical section protected by mutual exclusion but it is guaranteed to run only after the refcount of the socket has reached zero. All the other code paths in the trace increment the refcount of the socket while they manipulate dsts (the connect() syscall uses the refcount incremented by the socket() syscall). > > This is a trace for the socket for which ip6_negative_advice() was executed for an expired cached dst: > > socket ff1c4158c07aaf80 > > alloc: 1 destroy: 1 obj diff: 0 > > hold ops: 9 put ops: 9 refcnt diff: 0 > > ip6_negative_advice+0xcd/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > > tcp_retransmit_timer+0x50d/0xac0: tcp_write_timer_handler+0xba/0x1f0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 2 > > ip6_negative_advice+0x169/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 3 > > sk_setup_caps+0x5/0x200: inet6_sk_rebuild_header+0x1f9/0x2a0: sk_dst ff1c4151bcac1080 ff1c4153a762d500 ff1c4158c07aaf80 5 > > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1 > > sk_free+0x5/0x100: tcp_close+0x100/0x120: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > > __sk_free+0x5/0x190: sk_free+0xe3/0x100: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > > sk_destruct+0x5/0x150: __sk_free+0x83/0x190: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > > sk_alloc+0x195/0x2a0: inet6_create+0xd5/0x450: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > > sk_free+0x5/0x100: ip6_rcv_core.isra.25+0x27c/0x440: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 7 > > sk_reset_timer+0x7b/0x130: tcp_connect+0x83f/0xe00: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > > __tcp_close+0x15d/0x4c0: tcp_close+0x35/0x120: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > > sock_init_data_uid+0x107/0x2d0: inet6_create+0xec/0x450: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > > sk_reset_timer+0x7b/0x130: tcp_retransmit_timer+0x4e3/0xac0: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 6 > > tcp_close+0x5/0x120: inet_release+0x3c/0x80: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > > inet_csk_destroy_sock+0x90/0x1e0: __tcp_close+0x2ca/0x4c0: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 1 > > tcp_write_timer+0x61/0x260: call_timer_fn+0x27/0x130: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 7 > > Besides dst ff1c4157413bf900, this sockets held a reference to another dst - ff1c4153a762d500. Regardless of the order in which the dst references were held, the extra dst_hold() in ip6_negative_advice() would preserve the refcount increment owned by the socket while also resetting the sk_dst_cache member of the socket to NULL, rendering it impossible for the socket to decrement the refcount in the future. We see that the socket was destroyed but the sk_dst_cache member was most probably NULL at the time because all the dst references were reset in tcp_retransmit_timer() and ip6_negative_advice(). > > Based on this data, I went back to recheck the kprobe module that was meant to implement a dst_release() in ip6_negative_advice(). I found a mistake in the module that resulted in executing: > > static void ip6_negative_advice(struct sock *sk, struct dst_entry *dst) > > { > > struct rt6_info *rt = (struct rt6_info *) dst; > > if (rt->rt6i_flags & RTF_CACHE) { > > rcu_read_lock(); > > if (rt6_check_expired(rt)) { > > /* counteract the dst_release() in sk_dst_reset() */ > > dst_hold(dst); > > sk_dst_reset(sk); > > rt6_remove_exception_rt(rt); > > } > > ---> kprobe_calls_dst_release; > > rcu_read_unlock(); > > return; > > } > > sk_dst_reset(sk); > > } > instead of this: > > static void ip6_negative_advice(struct sock *sk, struct dst_entry *dst) > > { > > struct rt6_info *rt = (struct rt6_info *) dst; > > if (rt->rt6i_flags & RTF_CACHE) { > > rcu_read_lock(); > > if (rt6_check_expired(rt)) { > > /* counteract the dst_release() in sk_dst_reset() */ > > dst_hold(dst); > > ---> kprobe_calls_dst_release; > > sk_dst_reset(sk); > > rt6_remove_exception_rt(rt); > > } > > rcu_read_unlock(); > > return; > > } > > sk_dst_reset(sk); > > } > which clearly shows why the test with the module resulted in negative overflow of dst refcounts. I fixed the module and further testing showed the issue is no longer reproducible. > > I have managed to put together a minimal set of steps needed to reproduce the issue: > ip link add veth1 mtu 65535 type veth peer veth0 mtu 65535 > ip netns add ns0 > ip link set veth1 netns ns0 > ip addr add fd00::1/24 dev veth0 > ip -n ns0 addr add fd00::2/24 dev veth1 > ip link set up dev veth0 > ip -n ns0 link set up dev lo > ip -n ns0 link set up dev veth1 > ip -n ns0 route add default via fd00::1 dev veth1 > > ip link add veth3 mtu 65535 type veth peer veth2 mtu 65535 > ip netns add ns2 > ip link set veth3 netns ns2 > ip addr add fd02::1/24 dev veth2 > ip -n ns2 addr add fd02::2/24 dev veth3 > ip link set up dev veth2 > ip -n ns2 link set up dev lo > ip -n ns2 link set up dev veth3 > ip -n ns2 route add default via fd02::1 dev veth3 > > ip netns exec ns0 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires" > #ip netns exec ns2 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires" > ip netns exec ns0 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval" > #ip netns exec ns2 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval" > sleep 30 > > ip6tables -F > ip6tables -A FORWARD -i veth0 -d fd02::/24 -j ACCEPT > ip6tables -A FORWARD -i veth2 -d fd00::/24 -j ACCEPT > echo 1 > /proc/sys/net/ipv6/conf/all/forwarding > ip6tables -L -v > > (ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$! > sleep 1 > dd if=/dev/zero bs=1M count=100 | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$! > sleep 1 > kill $clnt $serv > wait $serv > > (ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$! > sleep 1 > dd if=/dev/zero bs=1M | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$! > sleep 1 > ip link set veth2 mtu 2000 > sleep 1 > ip6tables -D FORWARD -i veth2 -d fd00::/24 -j ACCEPT > ip6tables -A FORWARD -i veth2 -d fd00::/24 -j DROP > > sleep 10 > kill $clnt $serv > wait $serv > > ip -n ns0 link set down dev lo > ip -n ns0 link set down dev veth1 > ip -n ns0 link delete dev veth1 > ip netns delete ns0 > > ip -n ns2 link set down dev lo > ip -n ns2 link set down dev veth3 > ip -n ns2 link delete dev veth3 > ip netns delete ns2 > > I was able to reproduce the dst leak under 6.12-rc7. I will submit a fix shortly. > -- > Jiri Wiesner > SUSE Labs
On Tue, Nov 12, 2024 at 04:51:27PM -0500, Xin Long wrote: > On Tue, Nov 12, 2024 at 3:57 AM Jiri Wiesner <jwiesner@suse.de> wrote: > > > > On Sun, Oct 06, 2024 at 02:25:25PM -0400, Xin Long wrote: > > > We recently also encountered this > > > > > > 'unregister_netdevice: waiting for lo to become free. Usage count = X' > > > > > > problem on our customer env after backporting > > > > > > Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1] > > > > > > The commit looks correct to me, so I guess it may uncover some existing > > > issues. > > > > > > As it took a very long time to get reproduced on our customer env, which > > > made it impossible to debug. Also the issue existed even after > > > disabling IPv6. > > > > > > It seems much easier to reproduce it on your customer env. So I'm wondering > > > > > > - Was the testing on your customer env related to IPv6 ? > > > - Does the issue still exist after reverting the commit [1] ? > > > > The customer tried reproducing the issue with 92f1655aa2b22 ("net: fix __dst_negative_advice() race") reverted and the issue appeared again. My next step was capturing more points within the stacktraces when dst refcounts are changed (I do not have full stack traces - the trace entries I store in the hash contain only the instruction pointer and the parent function). This is the trace for the leaked dst object: > > > dst ff1c4157413bf900 > > > alloc: 1 destroy: obj diff: 1 > > > hold ops: 5 put ops: 4 refcnt diff: 1 > > > Function Parent Op Net Device Dst Number of Calls > > > inet6_csk_route_socket+0x1c2/0x2d0: inet6_csk_update_pmtu+0x58/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > > __ip6_rt_update_pmtu+0x183/0x3c0: inet6_csk_update_pmtu+0x4b/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > > ip6_dst_lookup_flow+0x4f/0x1d0: inet6_csk_route_socket+0x198/0x2d0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > > ip6_dst_ifdown+0x5/0x250: dst_dev_put+0x2d/0xd0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > > tcp_v6_connect+0x37e/0x790: __inet_stream_connect+0x2c8/0x3a0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2 > > > ip6_dst_lookup_flow+0x4f/0x1d0: tcp_v6_connect+0x320/0x790: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2 > > > __dev_queue_xmit+0x1af/0xd20: ip6_finish_output2+0x1f1/0x6e0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675 > > > inet6_csk_xmit+0xa1/0x150: __tcp_transmit_skb+0x5f8/0xd40: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675 > > > Function Parent Op Net Device Dst Refcount Diff > > > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x215/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > > ip6_negative_advice+0x66/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > > dst_alloc+0x5e/0x180: ip6_dst_alloc+0x27/0x60: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x10f/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 2 > > > dst_release+0x32/0x140: ip6_negative_advice+0x137/0x2d0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 1 > > > dst_release+0x32/0x140: inet_sock_destruct+0x146/0x1c0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 2 > > > dst_release+0x32/0x140: rt6_remove_exception.part.53+0x7f/0xe0: dst_put ffffffff90ee2a80 blackhole_dev ff1c4157413bf900 1 > > > Function Parent Op Net Dst Socket Number of Calls > > > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1 > > > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418e49bac280 1 > > > sk_setup_caps+0x5/0x200: inet6_csk_route_socket+0x23c/0x2d0: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418ea04f7200 1 > > > > The refcount of the dst was changed only when a dst_hold or dst_put operation was logged. The hold in dst_alloc() is balanced by the dst_release() in rt6_remove_exception(). The three holds in ip6_route_output_flags() belonged to the references held by sockets which were balanced by the dst_release() in inet_sock_destruct(). But that leaves one hold operation outstanding and unbalanced. There is a put operation for expired cached dst objects in ip6_negative_advice() but that one is intentionally balanced by a hold in the same function. > > > This makes sense to me. > > The code prior to 92f1655aa2b22 ("net: fix __dst_negative_advice() > race") had no longer been able to release the cached dst for the > reference held by socket in ip6_negative_advice() since > rt6_remove_exception_rt() is called there. > > Hi, David Ahern, > > Can you confirm this? I have submitted "[PATCH net] net/ipv6: release expired exception dst cached in socket" which should have clearer explanations not burdened by the debugging process. I initially debugged the issue on a customer's server without having any idea how to reproduce it. I then reconstructed the steps to reproduce the issue from the tracing data, which allowed me to check for the issue in the latest upstream kernel.
On 11/12/24 2:51 PM, Xin Long wrote: > This makes sense to me. > > The code prior to 92f1655aa2b22 ("net: fix __dst_negative_advice() > race") had no longer been able to release the cached dst for the > reference held by socket in ip6_negative_advice() since > rt6_remove_exception_rt() is called there. > > Hi, David Ahern, > > Can you confirm this? > > Thanks. I am traveling this week, so even less time than usual for patches. I will need to find to refresh my memory on the history of the code.
diff --git a/net/ipv6/route.c b/net/ipv6/route.c index b4251915585f..b70267c8d251 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk, if (rt->rt6i_flags & RTF_CACHE) { rcu_read_lock(); if (rt6_check_expired(rt)) { - /* counteract the dst_release() in sk_dst_reset() */ - dst_hold(dst); sk_dst_reset(sk); - rt6_remove_exception_rt(rt); } rcu_read_unlock();
An unbalanced refcount was reported for the loopback device of a net namespace being destroyed: unregister_netdevice: waiting for lo to become free. Usage count = 2 Analysis revealed that the IPv6 net device corresponding to the loopback device did not get destroyed (in6_dev_finish_destroy() was not called). The IPv6 loopback device had an unbalaced refcount: net dev 73da100 lo refcount 1 Operation Count Balancing Operation Count hold __ipv6_dev_mc_inc 2 ma_put 2 addrconf_ifdown 1 0 unbalanced hold fib6_nh_init 2 fib6_nh_init 2 put inet6_ifa_finish_destroy 1 ipv6_add_addr 1 ip6_dst_destroy 90 0 unbalanced ip6_dst_ifdown 90 0 unbalanced hold ip6_route_dev_notify 6 ip6_route_dev_notify 6 hold ipv6_add_addr 1 inet6_ifa_finish_destroy 1 put ma_put 2 __ipv6_dev_mc_inc 2 hold ndisc_netdev_event 2 ndisc_netdev_event 2 rt6_disable_ip 1 0 unbalanced The refcount of addrconf_ifdown() balances the refcount increment in ipv6_add_dev(), which had no corresponding trace entry. The rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the looback device, and the ip6_dst_destroy() entries were put operations. One refcount decrement in ip6_dst_destroy() was not executed. At this point, a hash was implemented in the debug kernel to hold the changes of the refcount of dst objects per namespace. The trace for the dst object that did not decrement the IPv6 refcount of loopback follows: Function Parent Op Net Device Dst Refcount Diff ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1 ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1 dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1 ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84 dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1 dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20 dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29 dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34 dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1 The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely indicates that the net device of the dst object was changed to blackhole_dev and the IPv6 refcount was transferred onto the loopback device. There was no ip6_dst_destroy() trace entry, which means the dst object was not destroyed. There were 86 hold operations but only 85 put operations so the dst object was not destroyed because the refcount of the dst object was unbalanced. The problem is that the refcount sums are ambiguous. The most probable explanation is this: The dst object was a route for an IPv6 TCP connection that kept timing out. Sometimes, the process closed the socket, which corresponds to the refcount decrements of the dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit timer reset the dst of the sockets, which corresponds to the dst_release()/tcp_retransmit_timer() entries. I am unsure about the dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets skb->_skb_refdst with SKB_DST_NOREF. The feature that sets the above trace apart from all the other dst traces is the execution of ip6_negative_advice() for a cached and also expired dst object in the exception table. The cached and expired dst object has its refcount set to at least 2 before executing rt6_remove_exception_rt() found in ip6_negative_advice(). One decrement happens in rt6_remove_exception() after the dst object has been removed from the exception table. The other decrement happens in sk_dst_reset() but that one is counteracted by a dst_hold() intentionally placed just before the sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that keeps a reference to a dst in its sk_dst_cache member increments the refcount of the dst by 1. This is apparent in the following code paths: * When ip6_route_output_flags() finds a dst that is then stored in sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket(). * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache Provided the dst is not kept in the sk_dst_cache member of another socket, there is no other object tied to the dst (the socket lost its reference and the dst is no longer in the exception table) and the dst becomes a leaked object after ip6_negative_advice() finishes. This leak then precludes the net namespace from being destroyed. The patch that introduced the dst_hold() in ip6_negative_advice() was 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22 only refactored the code with regards to the dst refcount so the issue was present even before 92f1655aa2b22. Signed-off-by: Jiri Wiesner <jwiesner@suse.de> --- At the moment, I am sending this as an RFC because I am not able to reproduce the issue in-house. The customer that encountered the issue is currently running tests. For the customer's testing, I fixed the issue with a kprobe module that calls dst_release() right after rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not quite the same as the change proposed below. net/ipv6/route.c | 3 --- 1 file changed, 3 deletions(-)