mbox series

[net-next,v4,00/10] net: netconsole refactoring and warning fix

Message ID 20240930131214.3771313-1-leitao@debian.org (mailing list archive)
Headers show
Series net: netconsole refactoring and warning fix | expand

Message

Breno Leitao Sept. 30, 2024, 1:11 p.m. UTC
The netconsole driver was showing a warning related to userdata
information, depending on the message size being transmitted:

	------------[ cut here ]------------
	WARNING: CPU: 13 PID: 3013042 at drivers/net/netconsole.c:1122 write_ext_msg+0x3b6/0x3d0
	 ? write_ext_msg+0x3b6/0x3d0
	 console_flush_all+0x1e9/0x330
	 ...

Identifying the cause of this warning proved to be non-trivial due to:

 * The write_ext_msg() function being over 100 lines long
 * Extensive use of pointer arithmetic
 * Inconsistent naming conventions and concept application

The send_ext_msg() function grew organically over time:

 * Initially, the UDP packet consisted of a header and body
 * Later additions included release prepend and userdata
 * Naming became inconsistent (e.g., "body" excludes userdata, "header"
   excludes prepended release)

This lack of consistency made investigating issues like the above warning
more challenging than what it should be.

To address these issues, the following steps were taken:

 * Breaking down write_ext_msg() into smaller functions with clear scopes
 * Improving readability and reasoning about the code
 * Simplifying and clarifying naming conventions

Warning Fix
-----------

The warning occurred when there was insufficient buffer space to append
userdata. While this scenario is acceptable (as userdata can be sent in a
separate packet later), the kernel was incorrectly raising a warning.  A
one-line fix has been implemented to resolve this issue.

A self-test was developed to write messages of every possible length
This test will be submitted in a separate patchset

Changelog:
v4:
 * Pass NULL to userdata in patch 08 ("net: netconsole: do not pass
   userdata up to the tail") (Simon)
 * Do not try to read nt->userdata_length outside
   CONFIG_NETCONSOLE_DYNAMIC in patch 3 ("net: netconsole: separate
   fragmented message handling in send_ext_msg") (Jakub)
 * Improve msgbody_written assignment in patch 6 ("net: netconsole:
   track explicitly if msgbody was written to buffer") (Jakub)

v3:
 * Fix variable definition to an earlier patch (Simon)
   * Same final code.
 * https://lore.kernel.org/all/20240910100410.2690012-1-leitao@debian.org/

v2:
 * Separated the userdata variable move to the tail function into a
   separated fix (Simon)
 * Reformated the patches to fit in 80-lines. Only one not respecting
   this is a copy from previous commit.
 * https://lore.kernel.org/all/20240909130756.2722126-1-leitao@debian.org/

v1:
 * https://lore.kernel.org/all/20240903140757.2802765-1-leitao@debian.org/

Breno Leitao (10):
  net: netconsole: remove msg_ready variable
  net: netconsole: split send_ext_msg_udp() function
  net: netconsole: separate fragmented message handling in send_ext_msg
  net: netconsole: rename body to msg_body
  net: netconsole: introduce variable to track body length
  net: netconsole: track explicitly if msgbody was written to buffer
  net: netconsole: extract release appending into separate function
  net: netconsole: do not pass userdata up to the tail
  net: netconsole: split send_msg_fragmented
  net: netconsole: fix wrong warning

 drivers/net/netconsole.c | 205 ++++++++++++++++++++++++++-------------
 1 file changed, 139 insertions(+), 66 deletions(-)

Comments

Jakub Kicinski Oct. 4, 2024, 12:29 a.m. UTC | #1
On Mon, 30 Sep 2024 06:11:59 -0700 Breno Leitao wrote:
> To address these issues, the following steps were taken:
> 
>  * Breaking down write_ext_msg() into smaller functions with clear scopes
>  * Improving readability and reasoning about the code
>  * Simplifying and clarifying naming conventions
> 
> Warning Fix
> -----------
> 
> The warning occurred when there was insufficient buffer space to append
> userdata. While this scenario is acceptable (as userdata can be sent in a
> separate packet later), the kernel was incorrectly raising a warning.  A
> one-line fix has been implemented to resolve this issue.
> 
> A self-test was developed to write messages of every possible length
> This test will be submitted in a separate patchset

Makes sense in general, but why isn't the fix sent to net first, 
and then once the trees converge (follow Thursday) we can apply 
the refactoring and improvements on top?

The false positive warning went into 6.9 if I'm checking correctly.
Breno Leitao Oct. 4, 2024, 8:50 a.m. UTC | #2
On Thu, Oct 03, 2024 at 05:29:50PM -0700, Jakub Kicinski wrote:
> On Mon, 30 Sep 2024 06:11:59 -0700 Breno Leitao wrote:
> > To address these issues, the following steps were taken:
> > 
> >  * Breaking down write_ext_msg() into smaller functions with clear scopes
> >  * Improving readability and reasoning about the code
> >  * Simplifying and clarifying naming conventions
> > 
> > Warning Fix
> > -----------
> > 
> > The warning occurred when there was insufficient buffer space to append
> > userdata. While this scenario is acceptable (as userdata can be sent in a
> > separate packet later), the kernel was incorrectly raising a warning.  A
> > one-line fix has been implemented to resolve this issue.
> > 
> > A self-test was developed to write messages of every possible length
> > This test will be submitted in a separate patchset
> 
> Makes sense in general, but why isn't the fix sent to net first, 
> and then once the trees converge (follow Thursday) we can apply 
> the refactoring and improvements on top?
> 
> The false positive warning went into 6.9 if I'm checking correctly.

Correct. I probably should have separated the fix from the refactor.

For context, I was pursuing the warning, and the code was hard to read,
so, I was refactoring the code while narrowing down the warning.

But you are correct, the warning is in 6.9+ kernels. But, keep in mind
that the warning is very hard to trigger, basically the length of userdata
and the message needs to be certain size to trigger it.
Jakub Kicinski Oct. 4, 2024, 2:30 p.m. UTC | #3
On Fri, 4 Oct 2024 01:50:13 -0700 Breno Leitao wrote:
> > Makes sense in general, but why isn't the fix sent to net first, 
> > and then once the trees converge (follow Thursday) we can apply 
> > the refactoring and improvements on top?
> > 
> > The false positive warning went into 6.9 if I'm checking correctly.  
> 
> Correct. I probably should have separated the fix from the refactor.
> 
> For context, I was pursuing the warning, and the code was hard to read,
> so, I was refactoring the code while narrowing down the warning.
> 
> But you are correct, the warning is in 6.9+ kernels. But, keep in mind
> that the warning is very hard to trigger, basically the length of userdata
> and the message needs to be certain size to trigger it.

Understood, and to be honest it's a bit of an efficiency thing on
maintainer side - we try to avoid shades of gray as much as possible
because debates on what is and isn't a fix can consume a ton of time.

So in networking we push people to send the fixes for net, even if
triggering the problem isn't very likely.