Message ID | 20220721120316.17070-1-ap420073@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: mld: do not use system_wq in the mld | expand |
On Thu, Jul 21, 2022 at 2:03 PM Taehee Yoo <ap420073@gmail.com> wrote: > > mld works are supposed to be executed in mld_wq. > But mld_{query | report}_work() calls schedule_delayed_work(). > schedule_delayed_work() internally uses system_wq. > So, this would cause the reference count leak. I do not think the changelog is accurate. At least I do not understand it yet. We can not unload the ipv6 module, so destroy_workqueue(mld_wq) is never used. > > splat looks like: > unregister_netdevice: waiting for br1 to become free. Usage count = 2 > leaked reference. > ipv6_add_dev+0x3a5/0x1070 > addrconf_notify+0x4f3/0x1760 > notifier_call_chain+0x9e/0x180 > register_netdevice+0xd10/0x11e0 > br_dev_newlink+0x27/0x100 [bridge] > __rtnl_newlink+0xd85/0x14e0 > rtnl_newlink+0x5f/0x90 > rtnetlink_rcv_msg+0x335/0x9a0 > netlink_rcv_skb+0x121/0x350 > netlink_unicast+0x439/0x710 > netlink_sendmsg+0x75f/0xc00 > ____sys_sendmsg+0x694/0x860 > ___sys_sendmsg+0xe9/0x160 > __sys_sendmsg+0xbe/0x150 > do_syscall_64+0x3b/0x90 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > > Fixes: f185de28d9ae ("mld: add new workqueues for process mld events") > Signed-off-by: Taehee Yoo <ap420073@gmail.com> > --- > net/ipv6/mcast.c | 14 ++++++++------ > 1 file changed, 8 insertions(+), 6 deletions(-) > > diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c > index 7f695c39d9a8..87c699d57b36 100644 > --- a/net/ipv6/mcast.c > +++ b/net/ipv6/mcast.c > @@ -1522,7 +1522,6 @@ static void mld_query_work(struct work_struct *work) > > if (++cnt >= MLD_MAX_QUEUE) { > rework = true; > - schedule_delayed_work(&idev->mc_query_work, 0); > break; > } > } > @@ -1533,8 +1532,10 @@ static void mld_query_work(struct work_struct *work) > __mld_query_work(skb); > mutex_unlock(&idev->mc_lock); > > - if (!rework) > - in6_dev_put(idev); > + if (rework && queue_delayed_work(mld_wq, &idev->mc_query_work, 0)) It seems the 'real issue' was that schedule_delayed_work(&idev->mc_query_work, 0) could be a NOP because the work queue was already scheduled ? > + return; > + > + in6_dev_put(idev); > } > > /* called with rcu_read_lock() */ > @@ -1624,7 +1625,6 @@ static void mld_report_work(struct work_struct *work) > > if (++cnt >= MLD_MAX_QUEUE) { > rework = true; > - schedule_delayed_work(&idev->mc_report_work, 0); > break; > } > } > @@ -1635,8 +1635,10 @@ static void mld_report_work(struct work_struct *work) > __mld_report_work(skb); > mutex_unlock(&idev->mc_lock); > > - if (!rework) > - in6_dev_put(idev); > + if (rework && queue_delayed_work(mld_wq, &idev->mc_report_work, 0)) > + return; > + > + in6_dev_put(idev); > } > > static bool is_in(struct ifmcaddr6 *pmc, struct ip6_sf_list *psf, int type, > -- > 2.17.1 >
Hi Eric, Thank you so much for your review! On 7/21/22 23:04, Eric Dumazet wrote: > On Thu, Jul 21, 2022 at 2:03 PM Taehee Yoo <ap420073@gmail.com> wrote: >> >> mld works are supposed to be executed in mld_wq. >> But mld_{query | report}_work() calls schedule_delayed_work(). >> schedule_delayed_work() internally uses system_wq. >> So, this would cause the reference count leak. > > I do not think the changelog is accurate. > At least I do not understand it yet. > > We can not unload the ipv6 module, so destroy_workqueue(mld_wq) is never used. > > > >> >> splat looks like: >> unregister_netdevice: waiting for br1 to become free. Usage count = 2 >> leaked reference. >> ipv6_add_dev+0x3a5/0x1070 >> addrconf_notify+0x4f3/0x1760 >> notifier_call_chain+0x9e/0x180 >> register_netdevice+0xd10/0x11e0 >> br_dev_newlink+0x27/0x100 [bridge] >> __rtnl_newlink+0xd85/0x14e0 >> rtnl_newlink+0x5f/0x90 >> rtnetlink_rcv_msg+0x335/0x9a0 >> netlink_rcv_skb+0x121/0x350 >> netlink_unicast+0x439/0x710 >> netlink_sendmsg+0x75f/0xc00 >> ____sys_sendmsg+0x694/0x860 >> ___sys_sendmsg+0xe9/0x160 >> __sys_sendmsg+0xbe/0x150 >> do_syscall_64+0x3b/0x90 >> entry_SYSCALL_64_after_hwframe+0x63/0xcd >> >> Fixes: f185de28d9ae ("mld: add new workqueues for process mld events") >> Signed-off-by: Taehee Yoo <ap420073@gmail.com> >> --- >> net/ipv6/mcast.c | 14 ++++++++------ >> 1 file changed, 8 insertions(+), 6 deletions(-) >> >> diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c >> index 7f695c39d9a8..87c699d57b36 100644 >> --- a/net/ipv6/mcast.c >> +++ b/net/ipv6/mcast.c >> @@ -1522,7 +1522,6 @@ static void mld_query_work(struct work_struct *work) >> >> if (++cnt >= MLD_MAX_QUEUE) { >> rework = true; >> - schedule_delayed_work(&idev->mc_query_work, 0); >> break; >> } >> } >> @@ -1533,8 +1532,10 @@ static void mld_query_work(struct work_struct *work) >> __mld_query_work(skb); >> mutex_unlock(&idev->mc_lock); >> >> - if (!rework) >> - in6_dev_put(idev); >> + if (rework && queue_delayed_work(mld_wq, &idev->mc_query_work, 0)) > > It seems the 'real issue' was that > schedule_delayed_work(&idev->mc_query_work, 0) could be a NOP > because the work queue was already scheduled ? > I think your assumption is right. I tested the below scenario, which occurs the real issue. THREAD0 THREAD1 mld_report_work() spin_lock_bh() if (!mod_delayed_work()) <-- queued in6_dev_hold(); spin_unlock_bh() spin_lock_bh() schedule_delayed_work() <-- return false, already queued by THREAD1 spin_unlock_bh() return; //no in6_dev_put() regardless return value of schedule_delayed_work(). In order to check, I added printk like below. if (++cnt >= MLD_MAX_QUEUE) { rework = true; if (!schedule_delayed_work(&idev->mc_report_work, 0)) printk("[TEST]%s %u \n", __func__, __LINE__); break; If the TEST log message is printed, work is already queued by other logic. So, it indicates a reference count is leaked. The result is that I can see log messages only when the reference count leak occurs. So, although I tested it only for 1 hour, I'm sure that this bug comes from missing check a return value of schedule_delayed_work(). As you said, this changelog is not correct. system_wq and mld_wq are not related to this issue. I would like to send a v2 patch after some more tests. The v2 patch will change the commit message. > > >> + return; >> + >> + in6_dev_put(idev); >> } >> >> /* called with rcu_read_lock() */ >> @@ -1624,7 +1625,6 @@ static void mld_report_work(struct work_struct *work) >> >> if (++cnt >= MLD_MAX_QUEUE) { >> rework = true; >> - schedule_delayed_work(&idev->mc_report_work, 0); >> break; >> } >> } >> @@ -1635,8 +1635,10 @@ static void mld_report_work(struct work_struct *work) >> __mld_report_work(skb); >> mutex_unlock(&idev->mc_lock); >> >> - if (!rework) >> - in6_dev_put(idev); >> + if (rework && queue_delayed_work(mld_wq, &idev->mc_report_work, 0)) >> + return; >> + >> + in6_dev_put(idev); >> } >> >> static bool is_in(struct ifmcaddr6 *pmc, struct ip6_sf_list *psf, int type, >> -- >> 2.17.1 >> Thanks a lot! Taehee Yoo
On Thu, Jul 21, 2022 at 7:53 PM Taehee Yoo <ap420073@gmail.com> wrote: > > Hi Eric, > Thank you so much for your review! > ... > I think your assumption is right. > I tested the below scenario, which occurs the real issue. > THREAD0 THREAD1 > mld_report_work() > spin_lock_bh() > if (!mod_delayed_work()) <-- queued > in6_dev_hold(); > spin_unlock_bh() > spin_lock_bh() > schedule_delayed_work() <-- return false, already queued by THREAD1 > spin_unlock_bh() > return; > //no in6_dev_put() regardless return value of schedule_delayed_work(). > > In order to check, I added printk like below. > if (++cnt >= MLD_MAX_QUEUE) { > > rework = true; > > if (!schedule_delayed_work(&idev->mc_report_work, 0)) > printk("[TEST]%s %u \n", __func__, __LINE__); > break; > > > If the TEST log message is printed, work is already queued by other logic. > So, it indicates a reference count is leaked. > The result is that I can see log messages only when the reference count > leak occurs. > So, although I tested it only for 1 hour, I'm sure that this bug comes > from missing check a return value of schedule_delayed_work(). > > As you said, this changelog is not correct. > system_wq and mld_wq are not related to this issue. > > I would like to send a v2 patch after some more tests. > The v2 patch will change the commit message. Can you describe what kind of tests you are running ? Was it a syzbot report ?
On 7/22/22 03:34, Eric Dumazet wrote: > On Thu, Jul 21, 2022 at 7:53 PM Taehee Yoo <ap420073@gmail.com> wrote: >> >> Hi Eric, >> Thank you so much for your review! >> > > ... > >> I think your assumption is right. >> I tested the below scenario, which occurs the real issue. >> THREAD0 THREAD1 >> mld_report_work() >> spin_lock_bh() >> if (!mod_delayed_work()) <-- queued >> in6_dev_hold(); >> spin_unlock_bh() >> spin_lock_bh() >> schedule_delayed_work() <-- return false, already queued by THREAD1 >> spin_unlock_bh() >> return; >> //no in6_dev_put() regardless return value of schedule_delayed_work(). >> >> In order to check, I added printk like below. >> if (++cnt >= MLD_MAX_QUEUE) { >> >> rework = true; >> >> if (!schedule_delayed_work(&idev->mc_report_work, 0)) >> printk("[TEST]%s %u \n", __func__, __LINE__); >> break; >> >> >> If the TEST log message is printed, work is already queued by other logic. >> So, it indicates a reference count is leaked. >> The result is that I can see log messages only when the reference count >> leak occurs. >> So, although I tested it only for 1 hour, I'm sure that this bug comes >> from missing check a return value of schedule_delayed_work(). >> >> As you said, this changelog is not correct. >> system_wq and mld_wq are not related to this issue. >> >> I would like to send a v2 patch after some more tests. >> The v2 patch will change the commit message. > > Can you describe what kind of tests you are running ? > Was it a syzbot report ? I found this bug while testing another syzbot report. (https://syzkaller.appspot.com/bug?id=ed41eaa4367b421d37aab5dee25e3f4c91ceae93) And I can't find the same case in the syzbot reports list. I just use some command lines and many kernel debug options such as kmemleak, kasan, lockdep, and others. a=ns-$RANDOM b=ns-$RANDOM ip netns add $a ip netns add $b echo 'file net/ipv6/addrconf.c +p' > /sys/kernel/debug/dynamic_debug/control echo 'file net/ipv6/addrconf_core.c +p' > /sys/kernel/debug/dynamic_debug/control echo 'file net/ipv6/mcast.c +p' > /sys/kernel/debug/dynamic_debug/control ip netns exec $a sysctl net.ipv6.mld_max_msf=8024 -w ip netns exec $b sysctl net.ipv6.mld_max_msf=8024 -w ip netns exec $a ip link add br0 type bridge mcast_querier 1 mcast_query_interval 100 mcast_mld_version 1 ip netns exec $b ip link add br1 type bridge mcast_querier 1 mcast_query_interval 100 mcast_mld_version 1 ip netns exec $a ip link set br0 up ip netns exec $b ip link set br1 up for i in {0..5} do ip netns exec $a ip link add vetha$i type veth peer name vethb$i ip netns exec $a ip link set vethb$i netns $b ip netns exec $a ip link set vetha$i master br0 ip netns exec $a ip link set vetha$i up ip netns exec $b ip link set vethb$i master br1 ip netns exec $b ip link set vethb$i up done sleep 10 ip netns del $a ip netns del $b This script is not a real use case. It just generates *looping* packets and they contain many mld packets due to the bridge query option. I think a packet generator would be more useful to reproduce this bug.
On Fri, Jul 22, 2022 at 04:01:13AM +0900, Taehee Yoo wrote: > > > On 7/22/22 03:34, Eric Dumazet wrote: > > On Thu, Jul 21, 2022 at 7:53 PM Taehee Yoo <ap420073@gmail.com> wrote: > >> > >> Hi Eric, > >> Thank you so much for your review! > >> > > > > ... > > > >> I think your assumption is right. > >> I tested the below scenario, which occurs the real issue. > >> THREAD0 THREAD1 > >> mld_report_work() > >> spin_lock_bh() > >> if (!mod_delayed_work()) <-- queued > >> in6_dev_hold(); > >> spin_unlock_bh() > >> spin_lock_bh() > >> schedule_delayed_work() <-- return false, already queued by THREAD1 > >> spin_unlock_bh() > >> return; > >> //no in6_dev_put() regardless return value of schedule_delayed_work(). > >> > >> In order to check, I added printk like below. > >> if (++cnt >= MLD_MAX_QUEUE) { > >> > >> rework = true; > >> > >> if (!schedule_delayed_work(&idev->mc_report_work, 0)) > >> printk("[TEST]%s %u \n", __func__, __LINE__); > >> break; > >> > >> > >> If the TEST log message is printed, work is already queued by other > logic. > >> So, it indicates a reference count is leaked. > >> The result is that I can see log messages only when the reference count > >> leak occurs. > >> So, although I tested it only for 1 hour, I'm sure that this bug comes > >> from missing check a return value of schedule_delayed_work(). > >> > >> As you said, this changelog is not correct. > >> system_wq and mld_wq are not related to this issue. > >> > >> I would like to send a v2 patch after some more tests. > >> The v2 patch will change the commit message. > > > > Can you describe what kind of tests you are running ? > > Was it a syzbot report ? > > I found this bug while testing another syzbot report. > (https://syzkaller.appspot.com/bug?id=ed41eaa4367b421d37aab5dee25e3f4c91ceae93) > And I can't find the same case in the syzbot reports list. > > I just use some command lines and many kernel debug options such as > kmemleak, kasan, lockdep, and others. > Hi Taehee, I got a similar issue with yours after Eric's 2d3916f31891 ("ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report()"). I use force_mld_version=1 and adding a lot of IPv6 address to generate the mld reports flood. Here is my reproducer: [root@bootp-73-131-221 ~]# cat mld.sh #!/bin/sh ip netns add ns1 ip netns add ns2 ip netns exec ns1 sysctl -w net.ipv6.conf.all.force_mld_version=1 ip netns exec ns2 sysctl -w net.ipv6.conf.all.force_mld_version=1 ip -n ns1 link add veth0 type veth peer name veth0 netns ns2 ip -n ns1 link set veth0 up ip -n ns2 link set veth0 up for i in `seq 50`; do for j in `seq 100`; do ip -n ns1 addr add 2021:${i}::${j}/64 dev veth0 ip -n ns2 addr add 2022:${i}::${j}/64 dev veth0 done done modprobe -r veth ip -a netns del After `modprobe -r veth` we will the the ref leaker error: [ 1382.683371] unregister_netdevice: waiting for veth0 to become free. Usage count = 2 [ 1392.931397] unregister_netdevice: waiting for veth0 to become free. Usage count = 2 [ 1402.939234] unregister_netdevice: waiting for veth0 to become free. Usage count = 2 [ 1413.179296] unregister_netdevice: waiting for veth0 to become free. Usage count = 2 I tried to debug the reason but didn't have much clue. Maybe this reproducer could help you. Thanks Hangbin
On Fri, Jul 22, 2022 at 10:35:52AM +0800, Hangbin Liu wrote: > > I found this bug while testing another syzbot report. > > (https://syzkaller.appspot.com/bug?id=ed41eaa4367b421d37aab5dee25e3f4c91ceae93) > > And I can't find the same case in the syzbot reports list. > > > > I just use some command lines and many kernel debug options such as > > kmemleak, kasan, lockdep, and others. > > > > Hi Taehee, > > I got a similar issue with yours after Eric's 2d3916f31891 > ("ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report()"). > I use force_mld_version=1 and adding a lot of IPv6 address to generate the > mld reports flood. Here is my reproducer: BTW, thanks for your fix. With your patch the issue is fixed. Please feel free to add Tested-by: Hangbin Liu <liuhangbin@gmail.com> Cheers Hangbin
Hi Hangbin, Thank you so much for the test and review! On 7/22/22 13:23, Hangbin Liu wrote: > On Fri, Jul 22, 2022 at 10:35:52AM +0800, Hangbin Liu wrote: >>> I found this bug while testing another syzbot report. >>> (https://syzkaller.appspot.com/bug?id=ed41eaa4367b421d37aab5dee25e3f4c91ceae93) >>> And I can't find the same case in the syzbot reports list. >>> >>> I just use some command lines and many kernel debug options such as >>> kmemleak, kasan, lockdep, and others. >>> >> >> Hi Taehee, >> >> I got a similar issue with yours after Eric's 2d3916f31891 >> ("ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report()"). >> I use force_mld_version=1 and adding a lot of IPv6 address to generate the >> mld reports flood. Here is my reproducer: > > BTW, thanks for your fix. With your patch the issue is fixed. Please feel free > to add > > Tested-by: Hangbin Liu <liuhangbin@gmail.com> > I also tested with your reproducer. I checked that it reproduces same issue. [ 69.862696][ T58] [TEST]mld_report_work 1629 [ 87.129371][ T10] unregister_netdevice: waiting for veth0 to become free. Usage count = 2 [ 87.132106][ T10] leaked reference. [ 87.133276][ T10] ipv6_add_dev+0x324/0xec0 [ 87.134724][ T10] addrconf_notify+0x481/0xd10 [ 87.136200][ T10] raw_notifier_call_chain+0xe3/0x120 [ 87.137829][ T10] call_netdevice_notifiers+0x106/0x160 [ 87.139454][ T10] register_netdevice+0x114c/0x16b0 [ 87.140380][ T10] veth_newlink+0x48b/0xa50 [veth] [ 87.141268][ T10] rtnl_newlink+0x11a2/0x1a40 [ 87.142073][ T10] rtnetlink_rcv_msg+0x63f/0xc00 [ 87.142956][ T10] netlink_rcv_skb+0x1df/0x3e0 [ 87.143861][ T10] netlink_unicast+0x5de/0x850 [ 87.144725][ T10] netlink_sendmsg+0x6c9/0xa90 [ 87.145595][ T10] ____sys_sendmsg+0x76a/0x780 [ 87.146483][ T10] __sys_sendmsg+0x27c/0x340 [ 87.147340][ T10] do_syscall_64+0x43/0x90 [ 87.148158][ T10] entry_SYSCALL_64_after_hwframe+0x63/0xcd printk message and calltrace are same. So, I'm sure that this is the same issue. Also, I tested with my patch and your script for 1 hour, and the reference count leak disappeared. If you are okay, I would like to attach your reproducer script to the commit message. > Cheers > Hangbin Thank you so much Taehee Yoo
On Fri, Jul 22, 2022 at 01:50:14PM +0900, Taehee Yoo wrote: > If you are okay, I would like to attach your reproducer script to the commit > message. Sure, please feel free to do that. Hangbin
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c index 7f695c39d9a8..87c699d57b36 100644 --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -1522,7 +1522,6 @@ static void mld_query_work(struct work_struct *work) if (++cnt >= MLD_MAX_QUEUE) { rework = true; - schedule_delayed_work(&idev->mc_query_work, 0); break; } } @@ -1533,8 +1532,10 @@ static void mld_query_work(struct work_struct *work) __mld_query_work(skb); mutex_unlock(&idev->mc_lock); - if (!rework) - in6_dev_put(idev); + if (rework && queue_delayed_work(mld_wq, &idev->mc_query_work, 0)) + return; + + in6_dev_put(idev); } /* called with rcu_read_lock() */ @@ -1624,7 +1625,6 @@ static void mld_report_work(struct work_struct *work) if (++cnt >= MLD_MAX_QUEUE) { rework = true; - schedule_delayed_work(&idev->mc_report_work, 0); break; } } @@ -1635,8 +1635,10 @@ static void mld_report_work(struct work_struct *work) __mld_report_work(skb); mutex_unlock(&idev->mc_lock); - if (!rework) - in6_dev_put(idev); + if (rework && queue_delayed_work(mld_wq, &idev->mc_report_work, 0)) + return; + + in6_dev_put(idev); } static bool is_in(struct ifmcaddr6 *pmc, struct ip6_sf_list *psf, int type,
mld works are supposed to be executed in mld_wq. But mld_{query | report}_work() calls schedule_delayed_work(). schedule_delayed_work() internally uses system_wq. So, this would cause the reference count leak. splat looks like: unregister_netdevice: waiting for br1 to become free. Usage count = 2 leaked reference. ipv6_add_dev+0x3a5/0x1070 addrconf_notify+0x4f3/0x1760 notifier_call_chain+0x9e/0x180 register_netdevice+0xd10/0x11e0 br_dev_newlink+0x27/0x100 [bridge] __rtnl_newlink+0xd85/0x14e0 rtnl_newlink+0x5f/0x90 rtnetlink_rcv_msg+0x335/0x9a0 netlink_rcv_skb+0x121/0x350 netlink_unicast+0x439/0x710 netlink_sendmsg+0x75f/0xc00 ____sys_sendmsg+0x694/0x860 ___sys_sendmsg+0xe9/0x160 __sys_sendmsg+0xbe/0x150 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: f185de28d9ae ("mld: add new workqueues for process mld events") Signed-off-by: Taehee Yoo <ap420073@gmail.com> --- net/ipv6/mcast.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)