diff mbox series

[RFC] ipv6: route: release reference of dsts cached in sockets

Message ID 20240930180916.GA24637@incl (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC] ipv6: route: release reference of dsts cached in sockets | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 9 this patch: 9
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 9 this patch: 9
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 12 this patch: 12
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 10 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jiri Wiesner Sept. 30, 2024, 6:09 p.m. UTC
An unbalanced refcount was reported for the loopback device of a net
namespace being destroyed:
unregister_netdevice: waiting for lo to become free. Usage count = 2

Analysis revealed that the IPv6 net device corresponding to the loopback
device did not get destroyed (in6_dev_finish_destroy() was not called).
The IPv6 loopback device had an unbalaced refcount:
net dev 73da100 lo refcount 1
Operation                     Count Balancing Operation     Count
hold  __ipv6_dev_mc_inc           2 ma_put                      2
      addrconf_ifdown             1                             0 unbalanced
hold  fib6_nh_init                2 fib6_nh_init                2
put   inet6_ifa_finish_destroy    1 ipv6_add_addr               1
      ip6_dst_destroy            90                             0 unbalanced
      ip6_dst_ifdown             90                             0 unbalanced
hold  ip6_route_dev_notify        6 ip6_route_dev_notify        6
hold  ipv6_add_addr               1 inet6_ifa_finish_destroy    1
put   ma_put                      2 __ipv6_dev_mc_inc           2
hold  ndisc_netdev_event          2 ndisc_netdev_event          2
      rt6_disable_ip              1                             0 unbalanced

The refcount of addrconf_ifdown() balances the refcount increment in
ipv6_add_dev(), which had no corresponding trace entry. The
rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the
looback device, and the ip6_dst_destroy() entries were put operations. One
refcount decrement in ip6_dst_destroy() was not executed. At this point, a
hash was implemented in the debug kernel to hold the changes of the
refcount of dst objects per namespace. The trace for the dst object that
did not decrement the IPv6 refcount of loopback follows:

Function        Parent       Op  Net            Device Dst              Refcount Diff
ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1
ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84
dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1
dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20
dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29
dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34
dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1

The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely
indicates that the net device of the dst object was changed to
blackhole_dev and the IPv6 refcount was transferred onto the loopback
device. There was no ip6_dst_destroy() trace entry, which means the dst
object was not destroyed. There were 86 hold operations but only 85 put
operations so the dst object was not destroyed because the refcount of the
dst object was unbalanced.

The problem is that the refcount sums are ambiguous. The most probable
explanation is this: The dst object was a route for an IPv6 TCP connection
that kept timing out. Sometimes, the process closed the socket, which
corresponds to the refcount decrements of the
dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit
timer reset the dst of the sockets, which corresponds to the
dst_release()/tcp_retransmit_timer() entries. I am unsure about the
dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets
skb->_skb_refdst with SKB_DST_NOREF.

The feature that sets the above trace apart from all the other dst traces
is the execution of ip6_negative_advice() for a cached and also expired
dst object in the exception table. The cached and expired dst object has
its refcount set to at least 2 before executing rt6_remove_exception_rt()
found in ip6_negative_advice(). One decrement happens in
rt6_remove_exception() after the dst object has been removed from the
exception table. The other decrement happens in sk_dst_reset() but that
one is counteracted by a dst_hold() intentionally placed just before the
sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that
keeps a reference to a dst in its sk_dst_cache member increments the
refcount of the dst by 1. This is apparent in the following code paths:
* When ip6_route_output_flags() finds a dst that is then stored in
  sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket().
* When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache
Provided the dst is not kept in the sk_dst_cache member of another socket,
there is no other object tied to the dst (the socket lost its reference
and the dst is no longer in the exception table) and the dst becomes a
leaked object after ip6_negative_advice() finishes. This leak then
precludes the net namespace from being destroyed.

The patch that introduced the dst_hold() in ip6_negative_advice() was
92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22
only refactored the code with regards to the dst refcount so the issue was
present even before 92f1655aa2b22.

Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
---
At the moment, I am sending this as an RFC because I am not able to 
reproduce the issue in-house. The customer that encountered the issue is 
currently running tests. For the customer's testing, I fixed the issue 
with a kprobe module that calls dst_release() right after 
rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not 
quite the same as the change proposed below.

 net/ipv6/route.c | 3 ---
 1 file changed, 3 deletions(-)

Comments

Eric Dumazet Sept. 30, 2024, 6:27 p.m. UTC | #1
On Mon, Sep 30, 2024 at 8:09 PM Jiri Wiesner <jwiesner@suse.de> wrote:
>
> An unbalanced refcount was reported for the loopback device of a net
> namespace being destroyed:
> unregister_netdevice: waiting for lo to become free. Usage count = 2
>
> Analysis revealed that the IPv6 net device corresponding to the loopback
> device did not get destroyed (in6_dev_finish_destroy() was not called).
> The IPv6 loopback device had an unbalaced refcount:
> net dev 73da100 lo refcount 1
> Operation                     Count Balancing Operation     Count
> hold  __ipv6_dev_mc_inc           2 ma_put                      2
>       addrconf_ifdown             1                             0 unbalanced
> hold  fib6_nh_init                2 fib6_nh_init                2
> put   inet6_ifa_finish_destroy    1 ipv6_add_addr               1
>       ip6_dst_destroy            90                             0 unbalanced
>       ip6_dst_ifdown             90                             0 unbalanced
> hold  ip6_route_dev_notify        6 ip6_route_dev_notify        6
> hold  ipv6_add_addr               1 inet6_ifa_finish_destroy    1
> put   ma_put                      2 __ipv6_dev_mc_inc           2
> hold  ndisc_netdev_event          2 ndisc_netdev_event          2
>       rt6_disable_ip              1                             0 unbalanced
>
> The refcount of addrconf_ifdown() balances the refcount increment in
> ipv6_add_dev(), which had no corresponding trace entry. The
> rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the
> looback device, and the ip6_dst_destroy() entries were put operations. One
> refcount decrement in ip6_dst_destroy() was not executed. At this point, a
> hash was implemented in the debug kernel to hold the changes of the
> refcount of dst objects per namespace. The trace for the dst object that
> did not decrement the IPv6 refcount of loopback follows:
>
> Function        Parent       Op  Net            Device Dst              Refcount Diff
> ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1
> ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84
> dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1
> dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20
> dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29
> dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34
> dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1
>
> The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely
> indicates that the net device of the dst object was changed to
> blackhole_dev and the IPv6 refcount was transferred onto the loopback
> device. There was no ip6_dst_destroy() trace entry, which means the dst
> object was not destroyed. There were 86 hold operations but only 85 put
> operations so the dst object was not destroyed because the refcount of the
> dst object was unbalanced.
>
> The problem is that the refcount sums are ambiguous. The most probable
> explanation is this: The dst object was a route for an IPv6 TCP connection
> that kept timing out. Sometimes, the process closed the socket, which
> corresponds to the refcount decrements of the
> dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit
> timer reset the dst of the sockets, which corresponds to the
> dst_release()/tcp_retransmit_timer() entries. I am unsure about the
> dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets
> skb->_skb_refdst with SKB_DST_NOREF.
>
> The feature that sets the above trace apart from all the other dst traces
> is the execution of ip6_negative_advice() for a cached and also expired
> dst object in the exception table. The cached and expired dst object has
> its refcount set to at least 2 before executing rt6_remove_exception_rt()
> found in ip6_negative_advice(). One decrement happens in
> rt6_remove_exception() after the dst object has been removed from the
> exception table. The other decrement happens in sk_dst_reset() but that
> one is counteracted by a dst_hold() intentionally placed just before the
> sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that
> keeps a reference to a dst in its sk_dst_cache member increments the
> refcount of the dst by 1. This is apparent in the following code paths:
> * When ip6_route_output_flags() finds a dst that is then stored in
>   sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket().
> * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache
> Provided the dst is not kept in the sk_dst_cache member of another socket,
> there is no other object tied to the dst (the socket lost its reference
> and the dst is no longer in the exception table) and the dst becomes a
> leaked object after ip6_negative_advice() finishes. This leak then
> precludes the net namespace from being destroyed.
>
> The patch that introduced the dst_hold() in ip6_negative_advice() was
> 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22
> only refactored the code with regards to the dst refcount so the issue was
> present even before 92f1655aa2b22.
>
> Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
> ---
> At the moment, I am sending this as an RFC because I am not able to
> reproduce the issue in-house. The customer that encountered the issue is
> currently running tests. For the customer's testing, I fixed the issue
> with a kprobe module that calls dst_release() right after
> rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not
> quite the same as the change proposed below.
>
>  net/ipv6/route.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index b4251915585f..b70267c8d251 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk,
>         if (rt->rt6i_flags & RTF_CACHE) {
>                 rcu_read_lock();
>                 if (rt6_check_expired(rt)) {
> -                       /* counteract the dst_release() in sk_dst_reset() */
> -                       dst_hold(dst);
>                         sk_dst_reset(sk);
> -
>                         rt6_remove_exception_rt(rt);
>                 }
>                 rcu_read_unlock();
> --

Interesting, what kernel version is your customer using ?

I think that with recent kernels (after 5.18), we do not see issues
because of the use of blackhole_netdev
instead of loopback.
Jiri Wiesner Oct. 1, 2024, 3:26 p.m. UTC | #2
On Mon, Sep 30, 2024 at 08:27:50PM +0200, Eric Dumazet wrote:
> On Mon, Sep 30, 2024 at 8:09 PM Jiri Wiesner <jwiesner@suse.de> wrote:
> >
> > An unbalanced refcount was reported for the loopback device of a net
> > namespace being destroyed:
> > unregister_netdevice: waiting for lo to become free. Usage count = 2
> >
> > Analysis revealed that the IPv6 net device corresponding to the loopback
> > device did not get destroyed (in6_dev_finish_destroy() was not called).
> > The IPv6 loopback device had an unbalaced refcount:
> > net dev 73da100 lo refcount 1
> > Operation                     Count Balancing Operation     Count
> > hold  __ipv6_dev_mc_inc           2 ma_put                      2
> >       addrconf_ifdown             1                             0 unbalanced
> > hold  fib6_nh_init                2 fib6_nh_init                2
> > put   inet6_ifa_finish_destroy    1 ipv6_add_addr               1
> >       ip6_dst_destroy            90                             0 unbalanced
> >       ip6_dst_ifdown             90                             0 unbalanced
> > hold  ip6_route_dev_notify        6 ip6_route_dev_notify        6
> > hold  ipv6_add_addr               1 inet6_ifa_finish_destroy    1
> > put   ma_put                      2 __ipv6_dev_mc_inc           2
> > hold  ndisc_netdev_event          2 ndisc_netdev_event          2
> >       rt6_disable_ip              1                             0 unbalanced
> >
> > The refcount of addrconf_ifdown() balances the refcount increment in
> > ipv6_add_dev(), which had no corresponding trace entry. The
> > rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the
> > looback device, and the ip6_dst_destroy() entries were put operations. One
> > refcount decrement in ip6_dst_destroy() was not executed. At this point, a
> > hash was implemented in the debug kernel to hold the changes of the
> > refcount of dst objects per namespace. The trace for the dst object that
> > did not decrement the IPv6 refcount of loopback follows:
> >
> > Function        Parent       Op  Net            Device Dst              Refcount Diff
> > ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84
> > dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20
> > dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29
> > dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34
> > dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1
> >
> > The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely
> > indicates that the net device of the dst object was changed to
> > blackhole_dev and the IPv6 refcount was transferred onto the loopback
> > device. There was no ip6_dst_destroy() trace entry, which means the dst
> > object was not destroyed. There were 86 hold operations but only 85 put
> > operations so the dst object was not destroyed because the refcount of the
> > dst object was unbalanced.
> >
> > The problem is that the refcount sums are ambiguous. The most probable
> > explanation is this: The dst object was a route for an IPv6 TCP connection
> > that kept timing out. Sometimes, the process closed the socket, which
> > corresponds to the refcount decrements of the
> > dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit
> > timer reset the dst of the sockets, which corresponds to the
> > dst_release()/tcp_retransmit_timer() entries. I am unsure about the
> > dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets
> > skb->_skb_refdst with SKB_DST_NOREF.
> >
> > The feature that sets the above trace apart from all the other dst traces
> > is the execution of ip6_negative_advice() for a cached and also expired
> > dst object in the exception table. The cached and expired dst object has
> > its refcount set to at least 2 before executing rt6_remove_exception_rt()
> > found in ip6_negative_advice(). One decrement happens in
> > rt6_remove_exception() after the dst object has been removed from the
> > exception table. The other decrement happens in sk_dst_reset() but that
> > one is counteracted by a dst_hold() intentionally placed just before the
> > sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that
> > keeps a reference to a dst in its sk_dst_cache member increments the
> > refcount of the dst by 1. This is apparent in the following code paths:
> > * When ip6_route_output_flags() finds a dst that is then stored in
> >   sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket().
> > * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache
> > Provided the dst is not kept in the sk_dst_cache member of another socket,
> > there is no other object tied to the dst (the socket lost its reference
> > and the dst is no longer in the exception table) and the dst becomes a
> > leaked object after ip6_negative_advice() finishes. This leak then
> > precludes the net namespace from being destroyed.
> >
> > The patch that introduced the dst_hold() in ip6_negative_advice() was
> > 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22
> > only refactored the code with regards to the dst refcount so the issue was
> > present even before 92f1655aa2b22.
> >
> > Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
> > ---
> > At the moment, I am sending this as an RFC because I am not able to
> > reproduce the issue in-house. The customer that encountered the issue is
> > currently running tests. For the customer's testing, I fixed the issue
> > with a kprobe module that calls dst_release() right after
> > rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not
> > quite the same as the change proposed below.
> >
> >  net/ipv6/route.c | 3 ---
> >  1 file changed, 3 deletions(-)
> >
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index b4251915585f..b70267c8d251 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk,
> >         if (rt->rt6i_flags & RTF_CACHE) {
> >                 rcu_read_lock();
> >                 if (rt6_check_expired(rt)) {
> > -                       /* counteract the dst_release() in sk_dst_reset() */
> > -                       dst_hold(dst);
> >                         sk_dst_reset(sk);
> > -
> >                         rt6_remove_exception_rt(rt);
> >                 }
> >                 rcu_read_unlock();
> > --
> 
> Interesting, what kernel version is your customer using ?

It is a distribution kernel (SLES 15 SP5) that is based on 5.14. The newest SLES service pack, SP6, has a 6.4-based kernel. But the customer has yet to move to that one.

> I think that with recent kernels (after 5.18), we do not see issues
> because of the use of blackhole_netdev
> instead of loopback.

Thank you very much for mentioning that. I will be backporting these patches to SUSE's 5.14-based kernels:
> 9cc341286e99 ("dn_route: set rt neigh to blackhole_netdev instead of loopback_dev in ifdown")
> 4d33ab08c0af ("xfrm: set dst dev to blackhole_netdev instead of loopback_dev in ifdown")
> dd263a8cb194 ("ipv6: blackhole_netdev needs snmp6 counters")
> e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev")
These patches will not resolve the dst leak but being able to destroy net namespaces at the expense of sustaining infrequently occurring dst leaks is a favourable trade.

The testing results (debugging SLES15 SP5 kernel plus the kprobe module calling dst_release() in probe_ip6_negative_advice()) came back and the kernel log indicates negative overflow of the dst refcount. The first machine:
> [ 9366.700330] dst_release underflow
> [ 9366.700346] WARNING: CPU: 42 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0
> [ 9366.701492] Call Trace:
> [ 9366.701497]  <IRQ>
> [ 9366.701506]  probe_ip6_negative_advice+0x12/0x30 [probe 0c57153a505ed7cc20e2c5db9522a1a524ec4f56]
> [ 9366.701514]  opt_pre_handler+0x3d/0x70
> [ 9366.701528]  optimized_callback+0x64/0x90
> [ 9366.701538]  0xffffffffc0476032
> [ 9366.701548]  ? ip6_negative_advice+0xdf/0x180
> [ 9366.701558]  ? tcp_retransmit_timer+0x658/0x930
> [ 9366.701595]  ? tcp_write_timer_handler+0xba/0x1f0
> [ 9366.701601]  ? tcp_write_timer+0x211/0x260
> [ 9366.701624]  ? call_timer_fn+0x27/0x130
> [ 9366.701631]  ? run_timer_softirq+0x443/0x480
> [ 9366.701645]  ? __do_softirq+0xd2/0x2c0
> [ 9366.701654]  ? irq_exit_rcu+0xa4/0xc0
> [ 9366.701661]  ? sysvec_apic_timer_interrupt+0x50/0x90
> [ 9366.701667]  </IRQ>
> [ 9366.701744] dst_release: dst:00000000ac3b9120 refcnt:-1
> [ 9366.701757] dst_release: dst:00000000ac3b9120 refcnt:-2
> [ 9382.832164] dst_release: dst:00000000ac3b9120 refcnt:-3
> [ 9659.561679] dst_release: dst:00000000494633c6 refcnt:-1
Obviously, the refcount of the cached and expired dst, dst:00000000ac3b9120, was 1 before probe_ip6_negative_advice() was called from tcp_write_timeout()/tcp_retransmit_timer(). So, my supposition that cached dsts being passed to probe_ip6_negative_advice() have a refcount of at least 2 was wrong.

The second machine:
[11479.272704] dst_release underflow
[11479.272718] WARNING: CPU: 52 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0
[11479.272924] Call Trace:
[11479.272928]  <IRQ>
[11479.272932]  tcp_retransmit_timer+0x47c/0x930
[11479.272953]  tcp_write_timer_handler+0xba/0x1f0
[11479.272956]  tcp_write_timer+0x211/0x260
[11479.272962]  call_timer_fn+0x27/0x130
[11479.272966]  run_timer_softirq+0x443/0x480
[11479.272973]  __do_softirq+0xd2/0x2c0
[11479.272980]  irq_exit_rcu+0xa4/0xc0
[11479.272985]  sysvec_apic_timer_interrupt+0x50/0x90
[11479.272989]  </IRQ>
[11479.273044] dst_release: dst:00000000ff7e6845 refcnt:-1
[11681.380331] dst_release: dst:00000000893d8a2d refcnt:-1
In this case, the refcount of dst:00000000ff7e6845 was have been 2 before probe_ip6_negative_advice() so the dst_release() called from the kprobe handler decremented the refcount to 0. No warning was printed. There probably was another socket having the dst stored in sk_dst_cache, to which the refcount of 1 belonged. When the second __sk_dst_reset() was called in tcp_retransmit_timer(), which was the tcp_retransmit_timer+0x47c site, the refcount dropped below 0.

I am afraid this patch is misguided. I would still like to find the source of the dst leak but I am also running out of time which the customer is willing to invest into investigating this issue.
Xin Long Oct. 2, 2024, 8:27 p.m. UTC | #3
On Tue, Oct 1, 2024 at 11:26 AM Jiri Wiesner <jwiesner@suse.de> wrote:
>
> On Mon, Sep 30, 2024 at 08:27:50PM +0200, Eric Dumazet wrote:
> > On Mon, Sep 30, 2024 at 8:09 PM Jiri Wiesner <jwiesner@suse.de> wrote:
> > >
> > > An unbalanced refcount was reported for the loopback device of a net
> > > namespace being destroyed:
> > > unregister_netdevice: waiting for lo to become free. Usage count = 2
> > >
> > > Analysis revealed that the IPv6 net device corresponding to the loopback
> > > device did not get destroyed (in6_dev_finish_destroy() was not called).
> > > The IPv6 loopback device had an unbalaced refcount:
> > > net dev 73da100 lo refcount 1
> > > Operation                     Count Balancing Operation     Count
> > > hold  __ipv6_dev_mc_inc           2 ma_put                      2
> > >       addrconf_ifdown             1                             0 unbalanced
> > > hold  fib6_nh_init                2 fib6_nh_init                2
> > > put   inet6_ifa_finish_destroy    1 ipv6_add_addr               1
> > >       ip6_dst_destroy            90                             0 unbalanced
> > >       ip6_dst_ifdown             90                             0 unbalanced
> > > hold  ip6_route_dev_notify        6 ip6_route_dev_notify        6
> > > hold  ipv6_add_addr               1 inet6_ifa_finish_destroy    1
> > > put   ma_put                      2 __ipv6_dev_mc_inc           2
> > > hold  ndisc_netdev_event          2 ndisc_netdev_event          2
> > >       rt6_disable_ip              1                             0 unbalanced
> > >
> > > The refcount of addrconf_ifdown() balances the refcount increment in
> > > ipv6_add_dev(), which had no corresponding trace entry. The
> > > rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the
> > > looback device, and the ip6_dst_destroy() entries were put operations. One
> > > refcount decrement in ip6_dst_destroy() was not executed. At this point, a
> > > hash was implemented in the debug kernel to hold the changes of the
> > > refcount of dst objects per namespace. The trace for the dst object that
> > > did not decrement the IPv6 refcount of loopback follows:
> > >
> > > Function        Parent       Op  Net            Device Dst              Refcount Diff
> > > ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > > ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > > dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > > ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84
> > > dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > > dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20
> > > dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29
> > > dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34
> > > dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1
> > >
> > > The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely
> > > indicates that the net device of the dst object was changed to
> > > blackhole_dev and the IPv6 refcount was transferred onto the loopback
> > > device. There was no ip6_dst_destroy() trace entry, which means the dst
> > > object was not destroyed. There were 86 hold operations but only 85 put
> > > operations so the dst object was not destroyed because the refcount of the
> > > dst object was unbalanced.
> > >
> > > The problem is that the refcount sums are ambiguous. The most probable
> > > explanation is this: The dst object was a route for an IPv6 TCP connection
> > > that kept timing out. Sometimes, the process closed the socket, which
> > > corresponds to the refcount decrements of the
> > > dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit
> > > timer reset the dst of the sockets, which corresponds to the
> > > dst_release()/tcp_retransmit_timer() entries. I am unsure about the
> > > dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets
> > > skb->_skb_refdst with SKB_DST_NOREF.
> > >
> > > The feature that sets the above trace apart from all the other dst traces
> > > is the execution of ip6_negative_advice() for a cached and also expired
> > > dst object in the exception table. The cached and expired dst object has
> > > its refcount set to at least 2 before executing rt6_remove_exception_rt()
> > > found in ip6_negative_advice(). One decrement happens in
> > > rt6_remove_exception() after the dst object has been removed from the
> > > exception table. The other decrement happens in sk_dst_reset() but that
> > > one is counteracted by a dst_hold() intentionally placed just before the
> > > sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that
> > > keeps a reference to a dst in its sk_dst_cache member increments the
> > > refcount of the dst by 1. This is apparent in the following code paths:
> > > * When ip6_route_output_flags() finds a dst that is then stored in
> > >   sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket().
> > > * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache
> > > Provided the dst is not kept in the sk_dst_cache member of another socket,
> > > there is no other object tied to the dst (the socket lost its reference
> > > and the dst is no longer in the exception table) and the dst becomes a
> > > leaked object after ip6_negative_advice() finishes. This leak then
> > > precludes the net namespace from being destroyed.
> > >
> > > The patch that introduced the dst_hold() in ip6_negative_advice() was
> > > 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22
> > > only refactored the code with regards to the dst refcount so the issue was
> > > present even before 92f1655aa2b22.
> > >
> > > Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
> > > ---
> > > At the moment, I am sending this as an RFC because I am not able to
> > > reproduce the issue in-house. The customer that encountered the issue is
> > > currently running tests. For the customer's testing, I fixed the issue
> > > with a kprobe module that calls dst_release() right after
> > > rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not
> > > quite the same as the change proposed below.
> > >
> > >  net/ipv6/route.c | 3 ---
> > >  1 file changed, 3 deletions(-)
> > >
> > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > > index b4251915585f..b70267c8d251 100644
> > > --- a/net/ipv6/route.c
> > > +++ b/net/ipv6/route.c
> > > @@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk,
> > >         if (rt->rt6i_flags & RTF_CACHE) {
> > >                 rcu_read_lock();
> > >                 if (rt6_check_expired(rt)) {
> > > -                       /* counteract the dst_release() in sk_dst_reset() */
> > > -                       dst_hold(dst);
> > >                         sk_dst_reset(sk);
> > > -
> > >                         rt6_remove_exception_rt(rt);
> > >                 }
> > >                 rcu_read_unlock();
> > > --
> >
> > Interesting, what kernel version is your customer using ?
>
> It is a distribution kernel (SLES 15 SP5) that is based on 5.14. The newest SLES service pack, SP6, has a 6.4-based kernel. But the customer has yet to move to that one.
>
> > I think that with recent kernels (after 5.18), we do not see issues
> > because of the use of blackhole_netdev
> > instead of loopback.
>
> Thank you very much for mentioning that. I will be backporting these patches to SUSE's 5.14-based kernels:
> > 9cc341286e99 ("dn_route: set rt neigh to blackhole_netdev instead of loopback_dev in ifdown")
> > 4d33ab08c0af ("xfrm: set dst dev to blackhole_netdev instead of loopback_dev in ifdown")
> > dd263a8cb194 ("ipv6: blackhole_netdev needs snmp6 counters")
> > e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev")
> These patches will not resolve the dst leak but being able to destroy net namespaces at the expense of sustaining infrequently occurring dst leaks is a favourable trade.
>
> The testing results (debugging SLES15 SP5 kernel plus the kprobe module calling dst_release() in probe_ip6_negative_advice()) came back and the kernel log indicates negative overflow of the dst refcount. The first machine:
> > [ 9366.700330] dst_release underflow
> > [ 9366.700346] WARNING: CPU: 42 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0
> > [ 9366.701492] Call Trace:
> > [ 9366.701497]  <IRQ>
> > [ 9366.701506]  probe_ip6_negative_advice+0x12/0x30 [probe 0c57153a505ed7cc20e2c5db9522a1a524ec4f56]
> > [ 9366.701514]  opt_pre_handler+0x3d/0x70
> > [ 9366.701528]  optimized_callback+0x64/0x90
> > [ 9366.701538]  0xffffffffc0476032
> > [ 9366.701548]  ? ip6_negative_advice+0xdf/0x180
> > [ 9366.701558]  ? tcp_retransmit_timer+0x658/0x930
> > [ 9366.701595]  ? tcp_write_timer_handler+0xba/0x1f0
> > [ 9366.701601]  ? tcp_write_timer+0x211/0x260
> > [ 9366.701624]  ? call_timer_fn+0x27/0x130
> > [ 9366.701631]  ? run_timer_softirq+0x443/0x480
> > [ 9366.701645]  ? __do_softirq+0xd2/0x2c0
> > [ 9366.701654]  ? irq_exit_rcu+0xa4/0xc0
> > [ 9366.701661]  ? sysvec_apic_timer_interrupt+0x50/0x90
> > [ 9366.701667]  </IRQ>
> > [ 9366.701744] dst_release: dst:00000000ac3b9120 refcnt:-1
> > [ 9366.701757] dst_release: dst:00000000ac3b9120 refcnt:-2
> > [ 9382.832164] dst_release: dst:00000000ac3b9120 refcnt:-3
> > [ 9659.561679] dst_release: dst:00000000494633c6 refcnt:-1
> Obviously, the refcount of the cached and expired dst, dst:00000000ac3b9120, was 1 before probe_ip6_negative_advice() was called from tcp_write_timeout()/tcp_retransmit_timer(). So, my supposition that cached dsts being passed to probe_ip6_negative_advice() have a refcount of at least 2 was wrong.
>
> The second machine:
> [11479.272704] dst_release underflow
> [11479.272718] WARNING: CPU: 52 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0
> [11479.272924] Call Trace:
> [11479.272928]  <IRQ>
> [11479.272932]  tcp_retransmit_timer+0x47c/0x930
> [11479.272953]  tcp_write_timer_handler+0xba/0x1f0
> [11479.272956]  tcp_write_timer+0x211/0x260
> [11479.272962]  call_timer_fn+0x27/0x130
> [11479.272966]  run_timer_softirq+0x443/0x480
> [11479.272973]  __do_softirq+0xd2/0x2c0
> [11479.272980]  irq_exit_rcu+0xa4/0xc0
> [11479.272985]  sysvec_apic_timer_interrupt+0x50/0x90
> [11479.272989]  </IRQ>
> [11479.273044] dst_release: dst:00000000ff7e6845 refcnt:-1
> [11681.380331] dst_release: dst:00000000893d8a2d refcnt:-1
> In this case, the refcount of dst:00000000ff7e6845 was have been 2 before probe_ip6_negative_advice() so the dst_release() called from the kprobe handler decremented the refcount to 0. No warning was printed. There probably was another socket having the dst stored in sk_dst_cache, to which the refcount of 1 belonged. When the second __sk_dst_reset() was called in tcp_retransmit_timer(), which was the tcp_retransmit_timer+0x47c site, the refcount dropped below 0.
>
> I am afraid this patch is misguided. I would still like to find the source of the dst leak but I am also running out of time which the customer is willing to invest into investigating this issue.
Is your kernel including this commit?

commit 28044fc1d4953b07acec0da4d2fc4784c57ea6fb
Author: Joanne Koong <joannelkoong@gmail.com>
Date:   Mon Aug 22 11:10:21 2022 -0700

    net: Add a bhash2 table hashed by port and address

After this commit, it seems in tcp_v6_connect(), the 'goto failure'
may cause a dst leak.:

        dst = ip6_dst_lookup_flow(net, sk, &fl6, final_p);
        ...
        if (!saddr) {
                saddr = &fl6.saddr;

                err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
                if (err)
                        goto failure; <---
        }
        ...
        ip6_dst_store(sk, dst, NULL, NULL);
Jiri Wiesner Oct. 3, 2024, 5:01 p.m. UTC | #4
On Wed, Oct 02, 2024 at 04:27:55PM -0400, Xin Long wrote:
> On Tue, Oct 1, 2024 at 11:26 AM Jiri Wiesner <jwiesner@suse.de> wrote:
> > I am afraid this patch is misguided. I would still like to find the source of the dst leak but I am also running out of time which the customer is willing to invest into investigating this issue.
> Is your kernel including this commit?
> 
> commit 28044fc1d4953b07acec0da4d2fc4784c57ea6fb
> Author: Joanne Koong <joannelkoong@gmail.com>
> Date:   Mon Aug 22 11:10:21 2022 -0700
> 
>     net: Add a bhash2 table hashed by port and address
> 
> After this commit, it seems in tcp_v6_connect(), the 'goto failure'
> may cause a dst leak.:
> 
>         dst = ip6_dst_lookup_flow(net, sk, &fl6, final_p);
>         ...
>         if (!saddr) {
>                 saddr = &fl6.saddr;
> 
>                 err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
>                 if (err)
>                         goto failure; <---
>         }
>         ...
>         ip6_dst_store(sk, dst, NULL, NULL);

Thanks for pointing this out. 28044fc1d495 seems to be an interesting commit as far as the number of Fixes is concerned. The commit was not backported to the 5.14-based SLES kernels, for which the unbalaced refcount bug was reported. The commit is part of the 6.4-based SLES kernels so I will have to see if all the patches with Fixes tags have been backported.
J.
Xin Long Oct. 6, 2024, 6:25 p.m. UTC | #5
On Thu, Oct 3, 2024 at 1:01 PM Jiri Wiesner <jwiesner@suse.de> wrote:
>
> On Wed, Oct 02, 2024 at 04:27:55PM -0400, Xin Long wrote:
> > On Tue, Oct 1, 2024 at 11:26 AM Jiri Wiesner <jwiesner@suse.de> wrote:
> > > I am afraid this patch is misguided. I would still like to find the source of the dst leak but I am also running out of time which the customer is willing to invest into investigating this issue.
> > Is your kernel including this commit?
> >
> > commit 28044fc1d4953b07acec0da4d2fc4784c57ea6fb
> > Author: Joanne Koong <joannelkoong@gmail.com>
> > Date:   Mon Aug 22 11:10:21 2022 -0700
> >
> >     net: Add a bhash2 table hashed by port and address
> >
> > After this commit, it seems in tcp_v6_connect(), the 'goto failure'
> > may cause a dst leak.:
> >
> >         dst = ip6_dst_lookup_flow(net, sk, &fl6, final_p);
> >         ...
> >         if (!saddr) {
> >                 saddr = &fl6.saddr;
> >
> >                 err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
> >                 if (err)
> >                         goto failure; <---
> >         }
> >         ...
> >         ip6_dst_store(sk, dst, NULL, NULL);
>
> Thanks for pointing this out. 28044fc1d495 seems to be an interesting commit as far as the number of Fixes is concerned. The commit was not backported to the 5.14-based SLES kernels, for which the unbalaced refcount bug was reported. The commit is part of the 6.4-based SLES kernels so I will have to see if all the patches with Fixes tags have been backported.
> J.
Hi, Jiri,

We recently also encountered this

  'unregister_netdevice: waiting for lo to become free. Usage count = X'

problem on our customer env after backporting

  Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1]

The commit looks correct to me, so I guess it may uncover some existing
issues.

As it took a very long time to get reproduced on our customer env, which
made it impossible to debug. Also the issue existed even after
disabling IPv6.

It seems much easier to reproduce it on your customer env. So I'm wondering

- Was the testing on your customer env related to IPv6 ?
- Does the issue still exist after reverting the commit [1] ?

Thanks.
Jiri Wiesner Oct. 9, 2024, 1:39 p.m. UTC | #6
On Sun, Oct 06, 2024 at 02:25:25PM -0400, Xin Long wrote:
> We recently also encountered this
> 
>   'unregister_netdevice: waiting for lo to become free. Usage count = X'
> 
> problem on our customer env after backporting
> 
>   Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1]
> 
> The commit looks correct to me, so I guess it may uncover some existing
> issues.

That is interesting. It seems possible to me.

> As it took a very long time to get reproduced on our customer env, which
> made it impossible to debug. Also the issue existed even after
> disabling IPv6.

AFAIK, the reproducibility of the issue hinges on net namespaces being created and destroyed often and TCP connections timing out in the namespaces. Traffic rate does not seem important.

> It seems much easier to reproduce it on your customer env. So I'm wondering

It takes roughly 24 hours but sometimes much more or much less to reproduce this.

> - Was the testing on your customer env related to IPv6 ?

Yes, it was. But I don't think the customer can disable IPv6. Their product is based on IPv6.

> - Does the issue still exist after reverting the commit [1] ?

I will ask the customer for a test.

Also, the issue is not reproducible after e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev") was backported.
J.
Jiri Wiesner Nov. 12, 2024, 8:56 a.m. UTC | #7
On Sun, Oct 06, 2024 at 02:25:25PM -0400, Xin Long wrote:
> We recently also encountered this
>
>   'unregister_netdevice: waiting for lo to become free. Usage count = X'
>
> problem on our customer env after backporting
>
>   Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1]
>
> The commit looks correct to me, so I guess it may uncover some existing
> issues.
>
> As it took a very long time to get reproduced on our customer env, which
> made it impossible to debug. Also the issue existed even after
> disabling IPv6.
>
> It seems much easier to reproduce it on your customer env. So I'm wondering
>
> - Was the testing on your customer env related to IPv6 ?
> - Does the issue still exist after reverting the commit [1] ?

The customer tried reproducing the issue with 92f1655aa2b22 ("net: fix __dst_negative_advice() race") reverted and the issue appeared again. My next step was capturing more points within the stacktraces when dst refcounts are changed (I do not have full stack traces - the trace entries I store in the hash contain only the instruction pointer and the parent function). This is the trace for the leaked dst object:
> dst ff1c4157413bf900
> alloc: 1 destroy:  obj diff: 1
> hold ops: 5 put ops: 4 refcnt diff: 1
> Function                            Parent                           Op  Net            Device Dst              Number of Calls
> inet6_csk_route_socket+0x1c2/0x2d0: inet6_csk_update_pmtu+0x58/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> __ip6_rt_update_pmtu+0x183/0x3c0: inet6_csk_update_pmtu+0x4b/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> ip6_dst_lookup_flow+0x4f/0x1d0: inet6_csk_route_socket+0x198/0x2d0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> ip6_dst_ifdown+0x5/0x250: dst_dev_put+0x2d/0xd0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> tcp_v6_connect+0x37e/0x790: __inet_stream_connect+0x2c8/0x3a0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> ip6_dst_lookup_flow+0x4f/0x1d0: tcp_v6_connect+0x320/0x790: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> __dev_queue_xmit+0x1af/0xd20: ip6_finish_output2+0x1f1/0x6e0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675
> inet6_csk_xmit+0xa1/0x150: __tcp_transmit_skb+0x5f8/0xd40: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675
> Function                            Parent                           Op      Net            Device Dst              Refcount Diff
> ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x215/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> ip6_negative_advice+0x66/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> dst_alloc+0x5e/0x180: ip6_dst_alloc+0x27/0x60: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x10f/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> dst_release+0x32/0x140: ip6_negative_advice+0x137/0x2d0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> dst_release+0x32/0x140: inet_sock_destruct+0x146/0x1c0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> dst_release+0x32/0x140: rt6_remove_exception.part.53+0x7f/0xe0: dst_put ffffffff90ee2a80 blackhole_dev ff1c4157413bf900 1
> Function                 Parent                      Op     Net              Dst              Socket           Number of Calls
> sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1
> sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418e49bac280 1
> sk_setup_caps+0x5/0x200: inet6_csk_route_socket+0x23c/0x2d0: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418ea04f7200 1

The refcount of the dst was changed only when a dst_hold or dst_put operation was logged. The hold in dst_alloc() is balanced by the dst_release() in rt6_remove_exception(). The three holds in ip6_route_output_flags() belonged to the references held by sockets which were balanced by the dst_release() in inet_sock_destruct(). But that leaves one hold operation outstanding and unbalanced. There is a put operation for expired cached dst objects in ip6_negative_advice() but that one is intentionally balanced by a hold in the same function.

Accoding to the sk_dst trace entries, three sockets held a reference to the leaked dst. All of these sockets were eventually distroyed so there was no socket leak causing the dst leak, which rules out one of the possbile hypotheses. The dst was stored in the socket object while a connect() sycall was being executed for two of the sockets. An IMCPv6 packet carrying a Packet Too Big message was received by the third socket, which lead to the execution of routines changing the MTU for the dst and storing a reference to the dst in the socket object. As for races, sockets are locked in inet_stream_connect, tcp_write_timer() and tcp_v6_err(), and the mutual exclusion prevents races when reference to dst objects are set and reset for sockets. The put operation executed from dst_release() in inet_sock_destruct() is not part of a critical section protected by mutual exclusion but it is guaranteed to run only after the refcount of the socket has reached zero. All the other code paths in the trace increment the refcount of the socket while they manipulate dsts (the connect() syscall uses the refcount incremented by the socket() syscall).

This is a trace for the socket for which ip6_negative_advice() was executed for an expired cached dst:
> socket ff1c4158c07aaf80
> alloc: 1 destroy: 1 obj diff: 0
> hold ops: 9 put ops: 9 refcnt diff: 0
> ip6_negative_advice+0xcd/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> tcp_retransmit_timer+0x50d/0xac0: tcp_write_timer_handler+0xba/0x1f0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 2
> ip6_negative_advice+0x169/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 3
> sk_setup_caps+0x5/0x200: inet6_sk_rebuild_header+0x1f9/0x2a0: sk_dst ff1c4151bcac1080 ff1c4153a762d500 ff1c4158c07aaf80 5
> sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1
> sk_free+0x5/0x100: tcp_close+0x100/0x120: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> __sk_free+0x5/0x190: sk_free+0xe3/0x100: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> sk_destruct+0x5/0x150: __sk_free+0x83/0x190: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> sk_alloc+0x195/0x2a0: inet6_create+0xd5/0x450: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> sk_free+0x5/0x100: ip6_rcv_core.isra.25+0x27c/0x440: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 7
> sk_reset_timer+0x7b/0x130: tcp_connect+0x83f/0xe00: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> __tcp_close+0x15d/0x4c0: tcp_close+0x35/0x120: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> sock_init_data_uid+0x107/0x2d0: inet6_create+0xec/0x450: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> sk_reset_timer+0x7b/0x130: tcp_retransmit_timer+0x4e3/0xac0: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 6
> tcp_close+0x5/0x120: inet_release+0x3c/0x80: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> inet_csk_destroy_sock+0x90/0x1e0: __tcp_close+0x2ca/0x4c0: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> tcp_write_timer+0x61/0x260: call_timer_fn+0x27/0x130: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 7

Besides dst ff1c4157413bf900, this sockets held a reference to another dst - ff1c4153a762d500. Regardless of the order in which the dst references were held, the extra dst_hold() in ip6_negative_advice() would preserve the refcount increment owned by the socket while also resetting the sk_dst_cache member of the socket to NULL, rendering it impossible for the socket to decrement the refcount in the future. We see that the socket was destroyed but the sk_dst_cache member was most probably NULL at the time because all the dst references were reset in tcp_retransmit_timer() and ip6_negative_advice().

Based on this data, I went back to recheck the kprobe module that was meant to implement a dst_release() in ip6_negative_advice(). I found a mistake in the module that resulted in executing:
> static void ip6_negative_advice(struct sock *sk, struct dst_entry *dst)
> {
>         struct rt6_info *rt = (struct rt6_info *) dst;
>         if (rt->rt6i_flags & RTF_CACHE) {
>                 rcu_read_lock();
>                 if (rt6_check_expired(rt)) {
>                         /* counteract the dst_release() in sk_dst_reset() */
>                         dst_hold(dst);
>                         sk_dst_reset(sk);
>                         rt6_remove_exception_rt(rt);
>                 }
> --->		  kprobe_calls_dst_release;
>                 rcu_read_unlock();
>                 return;
>         }
>         sk_dst_reset(sk);
> }
instead of this:
> static void ip6_negative_advice(struct sock *sk, struct dst_entry *dst)
> {
>         struct rt6_info *rt = (struct rt6_info *) dst;
>         if (rt->rt6i_flags & RTF_CACHE) {
>                 rcu_read_lock();
>                 if (rt6_check_expired(rt)) {
>                         /* counteract the dst_release() in sk_dst_reset() */
>                         dst_hold(dst);
> --->			  kprobe_calls_dst_release;
>                         sk_dst_reset(sk);
>                         rt6_remove_exception_rt(rt);
>                 }
>                 rcu_read_unlock();
>                 return;
>         }
>         sk_dst_reset(sk);
> }
which clearly shows why the test with the module resulted in negative overflow of dst refcounts. I fixed the module and further testing showed the issue is no longer reproducible.

I have managed to put together a minimal set of steps needed to reproduce the issue:
ip link add veth1 mtu 65535 type veth peer veth0 mtu 65535
ip netns add ns0
ip link set veth1 netns ns0
ip addr add fd00::1/24 dev veth0
ip -n ns0 addr add fd00::2/24 dev veth1
ip link set up dev veth0
ip -n ns0 link set up dev lo
ip -n ns0 link set up dev veth1
ip -n ns0 route add default via fd00::1 dev veth1

ip link add veth3 mtu 65535 type veth peer veth2 mtu 65535
ip netns add ns2
ip link set veth3 netns ns2
ip addr add fd02::1/24 dev veth2
ip -n ns2 addr add fd02::2/24 dev veth3
ip link set up dev veth2
ip -n ns2 link set up dev lo
ip -n ns2 link set up dev veth3
ip -n ns2 route add default via fd02::1 dev veth3

ip netns exec ns0 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires"
#ip netns exec ns2 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires"
ip netns exec ns0 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval"
#ip netns exec ns2 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval"
sleep 30

ip6tables -F
ip6tables -A FORWARD -i veth0 -d fd02::/24 -j ACCEPT
ip6tables -A FORWARD -i veth2 -d fd00::/24 -j ACCEPT
echo 1 > /proc/sys/net/ipv6/conf/all/forwarding
ip6tables -L -v

(ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$!
sleep 1
dd if=/dev/zero bs=1M count=100 | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$!
sleep 1
kill $clnt $serv
wait $serv

(ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$!
sleep 1
dd if=/dev/zero bs=1M | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$!
sleep 1
ip link set veth2 mtu 2000
sleep 1
ip6tables -D FORWARD -i veth2 -d fd00::/24 -j ACCEPT
ip6tables -A FORWARD -i veth2 -d fd00::/24 -j DROP

sleep 10
kill $clnt $serv
wait $serv

ip -n ns0 link set down dev lo
ip -n ns0 link set down dev veth1
ip -n ns0 link delete dev veth1
ip netns delete ns0

ip -n ns2 link set down dev lo
ip -n ns2 link set down dev veth3
ip -n ns2 link delete dev veth3
ip netns delete ns2

I was able to reproduce the dst leak under 6.12-rc7. I will submit a fix shortly.
Xin Long Nov. 12, 2024, 9:51 p.m. UTC | #8
On Tue, Nov 12, 2024 at 3:57 AM Jiri Wiesner <jwiesner@suse.de> wrote:
>
> On Sun, Oct 06, 2024 at 02:25:25PM -0400, Xin Long wrote:
> > We recently also encountered this
> >
> >   'unregister_netdevice: waiting for lo to become free. Usage count = X'
> >
> > problem on our customer env after backporting
> >
> >   Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1]
> >
> > The commit looks correct to me, so I guess it may uncover some existing
> > issues.
> >
> > As it took a very long time to get reproduced on our customer env, which
> > made it impossible to debug. Also the issue existed even after
> > disabling IPv6.
> >
> > It seems much easier to reproduce it on your customer env. So I'm wondering
> >
> > - Was the testing on your customer env related to IPv6 ?
> > - Does the issue still exist after reverting the commit [1] ?
>
> The customer tried reproducing the issue with 92f1655aa2b22 ("net: fix __dst_negative_advice() race") reverted and the issue appeared again. My next step was capturing more points within the stacktraces when dst refcounts are changed (I do not have full stack traces - the trace entries I store in the hash contain only the instruction pointer and the parent function). This is the trace for the leaked dst object:
> > dst ff1c4157413bf900
> > alloc: 1 destroy:  obj diff: 1
> > hold ops: 5 put ops: 4 refcnt diff: 1
> > Function                            Parent                           Op  Net            Device Dst              Number of Calls
> > inet6_csk_route_socket+0x1c2/0x2d0: inet6_csk_update_pmtu+0x58/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > __ip6_rt_update_pmtu+0x183/0x3c0: inet6_csk_update_pmtu+0x4b/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > ip6_dst_lookup_flow+0x4f/0x1d0: inet6_csk_route_socket+0x198/0x2d0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > ip6_dst_ifdown+0x5/0x250: dst_dev_put+0x2d/0xd0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > tcp_v6_connect+0x37e/0x790: __inet_stream_connect+0x2c8/0x3a0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > ip6_dst_lookup_flow+0x4f/0x1d0: tcp_v6_connect+0x320/0x790: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > __dev_queue_xmit+0x1af/0xd20: ip6_finish_output2+0x1f1/0x6e0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675
> > inet6_csk_xmit+0xa1/0x150: __tcp_transmit_skb+0x5f8/0xd40: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675
> > Function                            Parent                           Op      Net            Device Dst              Refcount Diff
> > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x215/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > ip6_negative_advice+0x66/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > dst_alloc+0x5e/0x180: ip6_dst_alloc+0x27/0x60: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x10f/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > dst_release+0x32/0x140: ip6_negative_advice+0x137/0x2d0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > dst_release+0x32/0x140: inet_sock_destruct+0x146/0x1c0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > dst_release+0x32/0x140: rt6_remove_exception.part.53+0x7f/0xe0: dst_put ffffffff90ee2a80 blackhole_dev ff1c4157413bf900 1
> > Function                 Parent                      Op     Net              Dst              Socket           Number of Calls
> > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1
> > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418e49bac280 1
> > sk_setup_caps+0x5/0x200: inet6_csk_route_socket+0x23c/0x2d0: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418ea04f7200 1
>
> The refcount of the dst was changed only when a dst_hold or dst_put operation was logged. The hold in dst_alloc() is balanced by the dst_release() in rt6_remove_exception(). The three holds in ip6_route_output_flags() belonged to the references held by sockets which were balanced by the dst_release() in inet_sock_destruct(). But that leaves one hold operation outstanding and unbalanced. There is a put operation for expired cached dst objects in ip6_negative_advice() but that one is intentionally balanced by a hold in the same function.
>
This makes sense to me.

The code prior to 92f1655aa2b22 ("net: fix __dst_negative_advice()
race") had no longer been able to release the cached dst for the
reference held by socket in ip6_negative_advice() since
rt6_remove_exception_rt() is called there.

Hi, David Ahern,

Can you confirm this?

Thanks.

> Accoding to the sk_dst trace entries, three sockets held a reference to the leaked dst. All of these sockets were eventually distroyed so there was no socket leak causing the dst leak, which rules out one of the possbile hypotheses. The dst was stored in the socket object while a connect() sycall was being executed for two of the sockets. An IMCPv6 packet carrying a Packet Too Big message was received by the third socket, which lead to the execution of routines changing the MTU for the dst and storing a reference to the dst in the socket object. As for races, sockets are locked in inet_stream_connect, tcp_write_timer() and tcp_v6_err(), and the mutual exclusion prevents races when reference to dst objects are set and reset for sockets. The put operation executed from dst_release() in inet_sock_destruct() is not part of a critical section protected by mutual exclusion but it is guaranteed to run only after the refcount of the socket has reached zero. All the other code paths in the trace increment the refcount of the socket while they manipulate dsts (the connect() syscall uses the refcount incremented by the socket() syscall).
>
> This is a trace for the socket for which ip6_negative_advice() was executed for an expired cached dst:
> > socket ff1c4158c07aaf80
> > alloc: 1 destroy: 1 obj diff: 0
> > hold ops: 9 put ops: 9 refcnt diff: 0
> > ip6_negative_advice+0xcd/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > tcp_retransmit_timer+0x50d/0xac0: tcp_write_timer_handler+0xba/0x1f0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 2
> > ip6_negative_advice+0x169/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 3
> > sk_setup_caps+0x5/0x200: inet6_sk_rebuild_header+0x1f9/0x2a0: sk_dst ff1c4151bcac1080 ff1c4153a762d500 ff1c4158c07aaf80 5
> > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1
> > sk_free+0x5/0x100: tcp_close+0x100/0x120: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > __sk_free+0x5/0x190: sk_free+0xe3/0x100: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > sk_destruct+0x5/0x150: __sk_free+0x83/0x190: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > sk_alloc+0x195/0x2a0: inet6_create+0xd5/0x450: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > sk_free+0x5/0x100: ip6_rcv_core.isra.25+0x27c/0x440: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 7
> > sk_reset_timer+0x7b/0x130: tcp_connect+0x83f/0xe00: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > __tcp_close+0x15d/0x4c0: tcp_close+0x35/0x120: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > sock_init_data_uid+0x107/0x2d0: inet6_create+0xec/0x450: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > sk_reset_timer+0x7b/0x130: tcp_retransmit_timer+0x4e3/0xac0: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 6
> > tcp_close+0x5/0x120: inet_release+0x3c/0x80: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > inet_csk_destroy_sock+0x90/0x1e0: __tcp_close+0x2ca/0x4c0: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > tcp_write_timer+0x61/0x260: call_timer_fn+0x27/0x130: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 7
>
> Besides dst ff1c4157413bf900, this sockets held a reference to another dst - ff1c4153a762d500. Regardless of the order in which the dst references were held, the extra dst_hold() in ip6_negative_advice() would preserve the refcount increment owned by the socket while also resetting the sk_dst_cache member of the socket to NULL, rendering it impossible for the socket to decrement the refcount in the future. We see that the socket was destroyed but the sk_dst_cache member was most probably NULL at the time because all the dst references were reset in tcp_retransmit_timer() and ip6_negative_advice().
>
> Based on this data, I went back to recheck the kprobe module that was meant to implement a dst_release() in ip6_negative_advice(). I found a mistake in the module that resulted in executing:
> > static void ip6_negative_advice(struct sock *sk, struct dst_entry *dst)
> > {
> >         struct rt6_info *rt = (struct rt6_info *) dst;
> >         if (rt->rt6i_flags & RTF_CACHE) {
> >                 rcu_read_lock();
> >                 if (rt6_check_expired(rt)) {
> >                         /* counteract the dst_release() in sk_dst_reset() */
> >                         dst_hold(dst);
> >                         sk_dst_reset(sk);
> >                         rt6_remove_exception_rt(rt);
> >                 }
> > --->            kprobe_calls_dst_release;
> >                 rcu_read_unlock();
> >                 return;
> >         }
> >         sk_dst_reset(sk);
> > }
> instead of this:
> > static void ip6_negative_advice(struct sock *sk, struct dst_entry *dst)
> > {
> >         struct rt6_info *rt = (struct rt6_info *) dst;
> >         if (rt->rt6i_flags & RTF_CACHE) {
> >                 rcu_read_lock();
> >                 if (rt6_check_expired(rt)) {
> >                         /* counteract the dst_release() in sk_dst_reset() */
> >                         dst_hold(dst);
> > --->                    kprobe_calls_dst_release;
> >                         sk_dst_reset(sk);
> >                         rt6_remove_exception_rt(rt);
> >                 }
> >                 rcu_read_unlock();
> >                 return;
> >         }
> >         sk_dst_reset(sk);
> > }
> which clearly shows why the test with the module resulted in negative overflow of dst refcounts. I fixed the module and further testing showed the issue is no longer reproducible.
>
> I have managed to put together a minimal set of steps needed to reproduce the issue:
> ip link add veth1 mtu 65535 type veth peer veth0 mtu 65535
> ip netns add ns0
> ip link set veth1 netns ns0
> ip addr add fd00::1/24 dev veth0
> ip -n ns0 addr add fd00::2/24 dev veth1
> ip link set up dev veth0
> ip -n ns0 link set up dev lo
> ip -n ns0 link set up dev veth1
> ip -n ns0 route add default via fd00::1 dev veth1
>
> ip link add veth3 mtu 65535 type veth peer veth2 mtu 65535
> ip netns add ns2
> ip link set veth3 netns ns2
> ip addr add fd02::1/24 dev veth2
> ip -n ns2 addr add fd02::2/24 dev veth3
> ip link set up dev veth2
> ip -n ns2 link set up dev lo
> ip -n ns2 link set up dev veth3
> ip -n ns2 route add default via fd02::1 dev veth3
>
> ip netns exec ns0 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires"
> #ip netns exec ns2 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires"
> ip netns exec ns0 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval"
> #ip netns exec ns2 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval"
> sleep 30
>
> ip6tables -F
> ip6tables -A FORWARD -i veth0 -d fd02::/24 -j ACCEPT
> ip6tables -A FORWARD -i veth2 -d fd00::/24 -j ACCEPT
> echo 1 > /proc/sys/net/ipv6/conf/all/forwarding
> ip6tables -L -v
>
> (ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$!
> sleep 1
> dd if=/dev/zero bs=1M count=100 | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$!
> sleep 1
> kill $clnt $serv
> wait $serv
>
> (ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$!
> sleep 1
> dd if=/dev/zero bs=1M | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$!
> sleep 1
> ip link set veth2 mtu 2000
> sleep 1
> ip6tables -D FORWARD -i veth2 -d fd00::/24 -j ACCEPT
> ip6tables -A FORWARD -i veth2 -d fd00::/24 -j DROP
>
> sleep 10
> kill $clnt $serv
> wait $serv
>
> ip -n ns0 link set down dev lo
> ip -n ns0 link set down dev veth1
> ip -n ns0 link delete dev veth1
> ip netns delete ns0
>
> ip -n ns2 link set down dev lo
> ip -n ns2 link set down dev veth3
> ip -n ns2 link delete dev veth3
> ip netns delete ns2
>
> I was able to reproduce the dst leak under 6.12-rc7. I will submit a fix shortly.
> --
> Jiri Wiesner
> SUSE Labs
Jiri Wiesner Nov. 13, 2024, 10:59 a.m. UTC | #9
On Tue, Nov 12, 2024 at 04:51:27PM -0500, Xin Long wrote:
> On Tue, Nov 12, 2024 at 3:57 AM Jiri Wiesner <jwiesner@suse.de> wrote:
> >
> > On Sun, Oct 06, 2024 at 02:25:25PM -0400, Xin Long wrote:
> > > We recently also encountered this
> > >
> > >   'unregister_netdevice: waiting for lo to become free. Usage count = X'
> > >
> > > problem on our customer env after backporting
> > >
> > >   Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1]
> > >
> > > The commit looks correct to me, so I guess it may uncover some existing
> > > issues.
> > >
> > > As it took a very long time to get reproduced on our customer env, which
> > > made it impossible to debug. Also the issue existed even after
> > > disabling IPv6.
> > >
> > > It seems much easier to reproduce it on your customer env. So I'm wondering
> > >
> > > - Was the testing on your customer env related to IPv6 ?
> > > - Does the issue still exist after reverting the commit [1] ?
> >
> > The customer tried reproducing the issue with 92f1655aa2b22 ("net: fix __dst_negative_advice() race") reverted and the issue appeared again. My next step was capturing more points within the stacktraces when dst refcounts are changed (I do not have full stack traces - the trace entries I store in the hash contain only the instruction pointer and the parent function). This is the trace for the leaked dst object:
> > > dst ff1c4157413bf900
> > > alloc: 1 destroy:  obj diff: 1
> > > hold ops: 5 put ops: 4 refcnt diff: 1
> > > Function                            Parent                           Op  Net            Device Dst              Number of Calls
> > > inet6_csk_route_socket+0x1c2/0x2d0: inet6_csk_update_pmtu+0x58/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > __ip6_rt_update_pmtu+0x183/0x3c0: inet6_csk_update_pmtu+0x4b/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > ip6_dst_lookup_flow+0x4f/0x1d0: inet6_csk_route_socket+0x198/0x2d0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > ip6_dst_ifdown+0x5/0x250: dst_dev_put+0x2d/0xd0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > tcp_v6_connect+0x37e/0x790: __inet_stream_connect+0x2c8/0x3a0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > > ip6_dst_lookup_flow+0x4f/0x1d0: tcp_v6_connect+0x320/0x790: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > > __dev_queue_xmit+0x1af/0xd20: ip6_finish_output2+0x1f1/0x6e0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675
> > > inet6_csk_xmit+0xa1/0x150: __tcp_transmit_skb+0x5f8/0xd40: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675
> > > Function                            Parent                           Op      Net            Device Dst              Refcount Diff
> > > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x215/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > ip6_negative_advice+0x66/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > dst_alloc+0x5e/0x180: ip6_dst_alloc+0x27/0x60: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x10f/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > > dst_release+0x32/0x140: ip6_negative_advice+0x137/0x2d0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > dst_release+0x32/0x140: inet_sock_destruct+0x146/0x1c0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > > dst_release+0x32/0x140: rt6_remove_exception.part.53+0x7f/0xe0: dst_put ffffffff90ee2a80 blackhole_dev ff1c4157413bf900 1
> > > Function                 Parent                      Op     Net              Dst              Socket           Number of Calls
> > > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1
> > > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418e49bac280 1
> > > sk_setup_caps+0x5/0x200: inet6_csk_route_socket+0x23c/0x2d0: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418ea04f7200 1
> >
> > The refcount of the dst was changed only when a dst_hold or dst_put operation was logged. The hold in dst_alloc() is balanced by the dst_release() in rt6_remove_exception(). The three holds in ip6_route_output_flags() belonged to the references held by sockets which were balanced by the dst_release() in inet_sock_destruct(). But that leaves one hold operation outstanding and unbalanced. There is a put operation for expired cached dst objects in ip6_negative_advice() but that one is intentionally balanced by a hold in the same function.
> >
> This makes sense to me.
> 
> The code prior to 92f1655aa2b22 ("net: fix __dst_negative_advice()
> race") had no longer been able to release the cached dst for the
> reference held by socket in ip6_negative_advice() since
> rt6_remove_exception_rt() is called there.
> 
> Hi, David Ahern,
> 
> Can you confirm this?

I have submitted "[PATCH net] net/ipv6: release expired exception dst cached in socket" which should have clearer explanations not burdened by the debugging process. I initially debugged the issue on a customer's server without having any idea how to reproduce it. I then reconstructed the steps to reproduce the issue from the tracing data, which allowed me to check for the issue in the latest upstream kernel.
David Ahern Nov. 13, 2024, 5:27 p.m. UTC | #10
On 11/12/24 2:51 PM, Xin Long wrote:
> This makes sense to me.
> 
> The code prior to 92f1655aa2b22 ("net: fix __dst_negative_advice()
> race") had no longer been able to release the cached dst for the
> reference held by socket in ip6_negative_advice() since
> rt6_remove_exception_rt() is called there.
> 
> Hi, David Ahern,
> 
> Can you confirm this?
> 
> Thanks.


I am traveling this week, so even less time than usual for patches. I
will need to find to refresh my memory on the history of the code.
diff mbox series

Patch

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b4251915585f..b70267c8d251 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2780,10 +2780,7 @@  static void ip6_negative_advice(struct sock *sk,
 	if (rt->rt6i_flags & RTF_CACHE) {
 		rcu_read_lock();
 		if (rt6_check_expired(rt)) {
-			/* counteract the dst_release() in sk_dst_reset() */
-			dst_hold(dst);
 			sk_dst_reset(sk);
-
 			rt6_remove_exception_rt(rt);
 		}
 		rcu_read_unlock();