net: bridge: Fix refcnt issues in dev_ioctl

Message ID	20230819081057.330728-1-astrajoan@yahoo.com (mailing list archive)
State	Rejected
Delegated to:	Netdev Maintainers
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 27224ED0 for <netdev@vger.kernel.org>; Sat, 19 Aug 2023 08:11:09 +0000 (UTC) From: Ziqi Zhao <astrajoan@yahoo.com> To: arnd@arndb.de, bridge@lists.linux-foundation.org, davem@davemloft.net, edumazet@google.com, f.fainelli@gmail.com, ivan.orlov0322@gmail.com, keescook@chromium.org, kuba@kernel.org, hkallweit1@gmail.com, mudongliangabcd@gmail.com, nikolay@nvidia.com, pabeni@redhat.com, razor@blackwall.org, roopa@nvidia.com, skhan@linuxfoundation.org, syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com, vladimir.oltean@nxp.com Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, syzkaller-bugs@googlegroups.com, Ziqi Zhao <astrajoan@yahoo.com> Subject: [PATCH] net: bridge: Fix refcnt issues in dev_ioctl Date: Sat, 19 Aug 2023 01:10:57 -0700 Message-Id: <20230819081057.330728-1-astrajoan@yahoo.com> In-Reply-To: <00000000000051197705fdbc7e54@google.com> References: <00000000000051197705fdbc7e54@google.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	net: bridge: Fix refcnt issues in dev_ioctl \| expand net: bridge: Fix refcnt issues in dev_ioctl

Context	Check	Description
netdev/series_format	warning	Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1333 this patch: 1333
netdev/cc_maintainers	warning	1 maintainers not CCed: glipus@gmail.com
netdev/build_clang	success	Errors and warnings before: 1353 this patch: 1353
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1356 this patch: 1356
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 44 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Ziqi Zhao Aug. 19, 2023, 8:10 a.m. UTC

In the bug reported by Syzbot, certain bridge devices would have a
leaked reference created by race conditions in dev_ioctl, specifically,
under SIOCBRADDIF or SIOCBRDELIF operations. The reference leak would
be shown in the periodic unregister_netdevice call, which throws a
warning and cause Syzbot to report a crash. Upon inspection of the
logic in dev_ioctl, it seems the reference was introduced to ensure
proper access to the bridge device after rtnl_unlock. and the latter
function is necessary to maintain the following lock order in any
bridge related ioctl calls:

1) br_ioctl_mutex => 2) rtnl_lock

Conceptually, though, br_ioctl_mutex could be considered more specific
than rtnl_lock given their usages, hence swapping their order would be
a reasonable proposal. This patch changes all related call sites to
maintain the reversed order of the two locks:

1) rtnl_lock => 2) br_ioctl_mutex

By doing so, the extra reference introduced in dev_ioctl is no longer
needed, and hence the reference leak bug is now resolved.

Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com
Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
Signed-off-by: Ziqi Zhao <astrajoan@yahoo.com>
---
 net/bridge/br_ioctl.c | 4 ----
 net/core/dev_ioctl.c  | 8 +-------
 net/socket.c          | 2 ++
 3 files changed, 3 insertions(+), 11 deletions(-)

Nikolay Aleksandrov Aug. 19, 2023, 9:25 a.m. UTC | #1

Hi Ziqi,
On 8/19/23 11:10, Ziqi Zhao wrote:
> In the bug reported by Syzbot, certain bridge devices would have a
> leaked reference created by race conditions in dev_ioctl, specifically,
> under SIOCBRADDIF or SIOCBRDELIF operations. The reference leak would

How would it leak a reference, could you elaborate?
The reference is always taken and always released after the call.

> be shown in the periodic unregister_netdevice call, which throws a
> warning and cause Syzbot to report a crash. Upon inspection of the

If you reproduced it, is the device later removed or is it really stuck?

> logic in dev_ioctl, it seems the reference was introduced to ensure
> proper access to the bridge device after rtnl_unlock. and the latter
> function is necessary to maintain the following lock order in any
> bridge related ioctl calls:
> 
> 1) br_ioctl_mutex => 2) rtnl_lock
> 
> Conceptually, though, br_ioctl_mutex could be considered more specific
> than rtnl_lock given their usages, hence swapping their order would be
> a reasonable proposal. This patch changes all related call sites to
> maintain the reversed order of the two locks:
> 
> 1) rtnl_lock => 2) br_ioctl_mutex
> 
> By doing so, the extra reference introduced in dev_ioctl is no longer
> needed, and hence the reference leak bug is now resolved.

IIRC there was no bug, it was a false-positive. The reference is held a 
bit longer but then released, so the device is deleted later.
I might be remembering wrong, but I think I briefly looked into this 
when it was reported. If that's not the case I'd be interested to see
a new report/trace because the bug might be somewhere else.

> 
> Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com
> Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl")
> Signed-off-by: Ziqi Zhao <astrajoan@yahoo.com>
> ---
>   net/bridge/br_ioctl.c | 4 ----
>   net/core/dev_ioctl.c  | 8 +-------
>   net/socket.c          | 2 ++
>   3 files changed, 3 insertions(+), 11 deletions(-)
> 

Thanks,
  Nik

Ziqi Zhao Aug. 19, 2023, 10:50 p.m. UTC | #2

On Sat, Aug 19, 2023 at 12:25:15PM +0300, Nikolay Aleksandrov wrote:
Hi Nik,

Thank you so much for reviewing the patch and getting back to me!

> IIRC there was no bug, it was a false-positive. The reference is held a bit
> longer but then released, so the device is deleted later.

> If you reproduced it, is the device later removed or is it really stuck?

I ran the reproducer again without the patch and it seems you are
correct. It was trying to create a very short-lived bridge, then delete
it immediately in the next call. The device in question "wpan4" never
showed up when I polled with `ip link` in the VM, so I'd say it did not
get stuck.

> How would it leak a reference, could you elaborate?
> The reference is always taken and always released after the call.

This was where I got a bit confused too. The system had a timeout of
140 seconds for the unregister_netdevice check. If the bridge in
question was created and deleted repeatedly, the warning would indeed
not be an actual reference leak. But how could its reference show up
after 140 seconds if the bridge's creation and deletion were all within
a couple of milliseconds?

So I let the system run for a bit longer with the reproducer, and after
~200 seconds, the kernel crashed and complained that some tasks had
been waiting for too long (more than 143 seconds) trying to get hold of
the br_ioctl_mutex. This was also quite strange to me, since on the
surface it definitely looked like a deadlock, but the strict locking
order as I described previously should prevent any deadlocks from
happening.

Anyways, I decided to test switching up the lock order, since both the
refcnt warning and the task stall seemed closely related to the above
mentioned locks. When I ran the reproducer again after the patch, both
the warning and the stall issue went away. So I guess the patch is
still relevant in preventing bugs in some extreme cases -- although the
scenario created by the reproducer would probably never happen in real
usages?

Please let me know whether you have any thoughts on how the above
issues were triggered, and what other information I could gather to
further demystify this bug. Thank you again for your help!

Best regards,
Ziqi

Nikolay Aleksandrov Aug. 22, 2023, 10:40 a.m. UTC | #3

On 8/20/23 01:50, Ziqi Zhao wrote:
> On Sat, Aug 19, 2023 at 12:25:15PM +0300, Nikolay Aleksandrov wrote:
> Hi Nik,
> 
> Thank you so much for reviewing the patch and getting back to me!
> 
>> IIRC there was no bug, it was a false-positive. The reference is held a bit
>> longer but then released, so the device is deleted later.
> 
>> If you reproduced it, is the device later removed or is it really stuck?
> 
> I ran the reproducer again without the patch and it seems you are
> correct. It was trying to create a very short-lived bridge, then delete
> it immediately in the next call. The device in question "wpan4" never
> showed up when I polled with `ip link` in the VM, so I'd say it did not
> get stuck.
> 
>> How would it leak a reference, could you elaborate?
>> The reference is always taken and always released after the call.
> 
> This was where I got a bit confused too. The system had a timeout of
> 140 seconds for the unregister_netdevice check. If the bridge in
> question was created and deleted repeatedly, the warning would indeed
> not be an actual reference leak. But how could its reference show up
> after 140 seconds if the bridge's creation and deletion were all within
> a couple of milliseconds?
> 
> So I let the system run for a bit longer with the reproducer, and after
> ~200 seconds, the kernel crashed and complained that some tasks had
> been waiting for too long (more than 143 seconds) trying to get hold of
> the br_ioctl_mutex. This was also quite strange to me, since on the
> surface it definitely looked like a deadlock, but the strict locking
> order as I described previously should prevent any deadlocks from
> happening.
> 
> Anyways, I decided to test switching up the lock order, since both the
> refcnt warning and the task stall seemed closely related to the above
> mentioned locks. When I ran the reproducer again after the patch, both
> the warning and the stall issue went away. So I guess the patch is
> still relevant in preventing bugs in some extreme cases -- although the
> scenario created by the reproducer would probably never happen in real
> usages?
> 

Thank you for testing, but we really need to understand what is going on 
and why the device isn't getting deleted for so long. Currently I don't 
have the time to debug it properly (I'll be able to next week at the 
earliest). We can't apply the patch based only on tests without 
understanding the underlying issue. I'd look into what
the reproducer is doing exactly and also check the system state while 
the deadlock has happened. Also you can list the currently held locks 
(if CONFIG_LOCKDEP is enabled) via magic sysrq + d for example. See 
which process is holding them, what are their priorities and so on.
Try to build some theory of how a deadlock might happen and then go
about proving it. Does the 8021q module have the same problem? It uses
similar code to set its hook.

> Please let me know whether you have any thoughts on how the above
> issues were triggered, and what other information I could gather to
> further demystify this bug. Thank you again for your help!
> 
> Best regards,
> Ziqi

Ziqi Zhao Aug. 23, 2023, 9:38 a.m. UTC | #4

On Tue, Aug 22, 2023 at 01:40:45PM +0300, Nikolay Aleksandrov wrote:

> Thank you for testing, but we really need to understand what is going on and
> why the device isn't getting deleted for so long. Currently I don't have the
> time to debug it properly (I'll be able to next week at the earliest). We
> can't apply the patch based only on tests without understanding the
> underlying issue. I'd look into what
> the reproducer is doing exactly and also check the system state while the
> deadlock has happened. Also you can list the currently held locks (if
> CONFIG_LOCKDEP is enabled) via magic sysrq + d for example. See which
> process is holding them, what are their priorities and so on.
> Try to build some theory of how a deadlock might happen and then go
> about proving it. Does the 8021q module have the same problem? It uses
> similar code to set its hook.

Hi Nik,

Thank you so much for the instructions! I was able to obtain a decoded
stacktrace showing the reproducer behavior in my QEMU VM running kernel
6.5-rc4, in case that would give us more context for pinpointing the
problem. Here's a link to the output:

https://pastecat.io/?p=IlKZlflN9j2Z2mspjKe7

Basically, after running the reproducer (line 1854) for about 180
seconnds or so, the unregister_netdevice warning was shown (line 1856),
and then after another 50 seconds, the kernel detected that some tasks
have been stalled for more than 143 seconds (line 1866), so it panicked
on the blocked tasks (line 2116). Before the panic, we did get to see
all the locks held in the system (line 2068), and it did show that many
processes created by the reproducer were contending the br_ioctl_mutex.
I'm now starting to wonder whether this is really a deadlock, or simply
some tasks not being able to grab the lock because so many processes
are trying to acquire it.

Let me know what you think about the situation shown in the above log,
and let's keep in touch for any future debugging. Thank you again for
guiding me through the problem!

Best regards,
Ziqi

Alexander Ofitserov Feb. 12, 2024, 3:28 p.m. UTC | #5

On Wed, Aug 23, 2023 at 00:38:46PM +0300, Ziqi Zhao wrote:
> On Tue, Aug 22, 2023 at 01:40:45PM +0300, Nikolay Aleksandrov wrote:
> > Thank you for testing, but we really need to understand what is going on
> > and why the device isn't getting deleted for so long. Currently I don't
> > have the time to debug it properly (I'll be able to next week at the
> > earliest). We can't apply the patch based only on tests without
> > understanding the underlying issue. I'd look into what
> > the reproducer is doing exactly and also check the system state while the
> > deadlock has happened. Also you can list the currently held locks (if
> > CONFIG_LOCKDEP is enabled) via magic sysrq + d for example. See which
> > process is holding them, what are their priorities and so on.
> > Try to build some theory of how a deadlock might happen and then go
> > about proving it. Does the 8021q module have the same problem? It uses
> > similar code to set its hook.
>
> Hi Nik,
>
> Thank you so much for the instructions! I was able to obtain a decoded
> stacktrace showing the reproducer behavior in my QEMU VM running kernel
> 6.5-rc4, in case that would give us more context for pinpointing the
> problem. Here's a link to the output:
>
> https://pastecat.io/?p=IlKZlflN9j2Z2mspjKe7
>
> Basically, after running the reproducer (line 1854) for about 180
> seconnds or so, the unregister_netdevice warning was shown (line 1856),
> and then after another 50 seconds, the kernel detected that some tasks
> have been stalled for more than 143 seconds (line 1866), so it panicked
> on the blocked tasks (line 2116). Before the panic, we did get to see
> all the locks held in the system (line 2068), and it did show that many
> processes created by the reproducer were contending the br_ioctl_mutex.
> I'm now starting to wonder whether this is really a deadlock, or simply
> some tasks not being able to grab the lock because so many processes
> are trying to acquire it.
>
> Let me know what you think about the situation shown in the above log,
> and let's keep in touch for any future debugging. Thank you again for
> guiding me through the problem!
>
> Best regards,
> Ziqi

Hello,

I've also encountered this bug while fuzzing. Is there any going work on this
bug?

net: bridge: Fix refcnt issues in dev_ioctl

Checks

Commit Message

Comments

Patch