diff mbox series

ftrace: make extra rcu_is_watching() validation check optional

Message ID 20240322160323.2463329-1-andrii@kernel.org (mailing list archive)
State Superseded
Headers show
Series ftrace: make extra rcu_is_watching() validation check optional | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Andrii Nakryiko March 22, 2024, 4:03 p.m. UTC
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true in practice and would be best
controlled with extra config to let users decide if they are willing to
pay the price.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig            | 13 +++++++++++++
 2 files changed, 14 insertions(+), 1 deletion(-)

Comments

Masami Hiramatsu (Google) March 25, 2024, 2:38 a.m. UTC | #1
On Fri, 22 Mar 2024 09:03:23 -0700
Andrii Nakryiko <andrii@kernel.org> wrote:

> Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> control whether ftrace low-level code performs additional
> rcu_is_watching()-based validation logic in an attempt to catch noinstr
> violations.
> 
> This check is expected to never be true in practice and would be best
> controlled with extra config to let users decide if they are willing to
> pay the price.

Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
is OK, but tht should be set to Y by default. If you have already verified
that your system never make it true and you want to optimize your ftrace
path, you can manually set it to N at your own risk.

> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/trace_recursion.h |  2 +-
>  kernel/trace/Kconfig            | 13 +++++++++++++
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
> index d48cd92d2364..24ea8ac049b4 100644
> --- a/include/linux/trace_recursion.h
> +++ b/include/linux/trace_recursion.h
> @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip);
>  # define do_ftrace_record_recursion(ip, pip)	do { } while (0)
>  #endif
>  
> -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
>  # define trace_warn_on_no_rcu(ip)					\
>  	({								\
>  		bool __ret = !rcu_is_watching();			\

BTW, maybe we can add "unlikely" in the next "if" line?

> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 61c541c36596..19bce4e217d6 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
>  	  This file can be reset, but the limit can not change in
>  	  size at runtime.
>  
> +config FTRACE_VALIDATE_RCU_IS_WATCHING
> +	bool "Validate RCU is on during ftrace recursion check"
> +	depends on FUNCTION_TRACER
> +	depends on ARCH_WANTS_NO_INSTR

	default y

> +	help
> +	  All callbacks that attach to the function tracing have some sort
> +	  of protection against recursion. This option performs additional
> +	  checks to make sure RCU is on when ftrace callbacks recurse.
> +
> +	  This will add more overhead to all ftrace-based invocations.

	... invocations, but keep it safe.

> +
> +	  If unsure, say N

	If unsure, say Y

Thank you,

> +
>  config RING_BUFFER_RECORD_RECURSION
>  	bool "Record functions that recurse in the ring buffer"
>  	depends on FTRACE_RECORD_RECURSION
> -- 
> 2.43.0
>
Andrii Nakryiko March 25, 2024, 4:56 p.m. UTC | #2
On Sun, Mar 24, 2024 at 7:38 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Fri, 22 Mar 2024 09:03:23 -0700
> Andrii Nakryiko <andrii@kernel.org> wrote:
>
> > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> > control whether ftrace low-level code performs additional
> > rcu_is_watching()-based validation logic in an attempt to catch noinstr
> > violations.
> >
> > This check is expected to never be true in practice and would be best
> > controlled with extra config to let users decide if they are willing to
> > pay the price.
>
> Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
> so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> is OK, but tht should be set to Y by default. If you have already verified
> that your system never make it true and you want to optimize your ftrace
> path, you can manually set it to N at your own risk.

Yeah, I don't think we ever see this warning across our machines. And
sure, I can default it to Y, no problem.

>
> >
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Masami Hiramatsu <mhiramat@kernel.org>
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/trace_recursion.h |  2 +-
> >  kernel/trace/Kconfig            | 13 +++++++++++++
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
> > index d48cd92d2364..24ea8ac049b4 100644
> > --- a/include/linux/trace_recursion.h
> > +++ b/include/linux/trace_recursion.h
> > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip);
> >  # define do_ftrace_record_recursion(ip, pip) do { } while (0)
> >  #endif
> >
> > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> >  # define trace_warn_on_no_rcu(ip)                                    \
> >       ({                                                              \
> >               bool __ret = !rcu_is_watching();                        \
>
> BTW, maybe we can add "unlikely" in the next "if" line?

sure, can add that as well

>
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index 61c541c36596..19bce4e217d6 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> >         This file can be reset, but the limit can not change in
> >         size at runtime.
> >
> > +config FTRACE_VALIDATE_RCU_IS_WATCHING
> > +     bool "Validate RCU is on during ftrace recursion check"
> > +     depends on FUNCTION_TRACER
> > +     depends on ARCH_WANTS_NO_INSTR
>
>         default y
>

ok

> > +     help
> > +       All callbacks that attach to the function tracing have some sort
> > +       of protection against recursion. This option performs additional
> > +       checks to make sure RCU is on when ftrace callbacks recurse.
> > +
> > +       This will add more overhead to all ftrace-based invocations.
>
>         ... invocations, but keep it safe.
>
> > +
> > +       If unsure, say N
>
>         If unsure, say Y
>

yep, will do, thanks!

> Thank you,
>
> > +
> >  config RING_BUFFER_RECORD_RECURSION
> >       bool "Record functions that recurse in the ring buffer"
> >       depends on FTRACE_RECORD_RECURSION
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
Steven Rostedt March 25, 2024, 10:13 p.m. UTC | #3
On Mon, 25 Mar 2024 11:38:48 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> On Fri, 22 Mar 2024 09:03:23 -0700
> Andrii Nakryiko <andrii@kernel.org> wrote:
> 
> > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> > control whether ftrace low-level code performs additional
> > rcu_is_watching()-based validation logic in an attempt to catch noinstr
> > violations.
> > 
> > This check is expected to never be true in practice and would be best
> > controlled with extra config to let users decide if they are willing to
> > pay the price.  
> 
> Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
> so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> is OK, but tht should be set to Y by default. If you have already verified
> that your system never make it true and you want to optimize your ftrace
> path, you can manually set it to N at your own risk.
> 

Really, it's for debugging. I would argue that it should *not* be default y.
Peter added this to find all the locations that could be called where RCU
is not watching. But the issue I have is that this is that it *does cause
overhead* with function tracing.

I believe we found pretty much all locations that were an issue, and we
should now just make it an option for developers.

It's no different than lockdep. Test boxes should have it enabled, but
there's no reason to have this enabled in a production system.

-- Steve


> > 
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Masami Hiramatsu <mhiramat@kernel.org>
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/trace_recursion.h |  2 +-
> >  kernel/trace/Kconfig            | 13 +++++++++++++
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
> > index d48cd92d2364..24ea8ac049b4 100644
> > --- a/include/linux/trace_recursion.h
> > +++ b/include/linux/trace_recursion.h
> > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip);
> >  # define do_ftrace_record_recursion(ip, pip)	do { } while (0)
> >  #endif
> >  
> > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> >  # define trace_warn_on_no_rcu(ip)					\
> >  	({								\
> >  		bool __ret = !rcu_is_watching();			\  
> 
> BTW, maybe we can add "unlikely" in the next "if" line?
> 
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index 61c541c36596..19bce4e217d6 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> >  	  This file can be reset, but the limit can not change in
> >  	  size at runtime.
> >  
> > +config FTRACE_VALIDATE_RCU_IS_WATCHING
> > +	bool "Validate RCU is on during ftrace recursion check"
> > +	depends on FUNCTION_TRACER
> > +	depends on ARCH_WANTS_NO_INSTR  
> 
> 	default y
> 
> > +	help
> > +	  All callbacks that attach to the function tracing have some sort
> > +	  of protection against recursion. This option performs additional
> > +	  checks to make sure RCU is on when ftrace callbacks recurse.
> > +
> > +	  This will add more overhead to all ftrace-based invocations.  
> 
> 	... invocations, but keep it safe.
> 
> > +
> > +	  If unsure, say N  
> 
> 	If unsure, say Y
> 
> Thank you,
> 
> > +
> >  config RING_BUFFER_RECORD_RECURSION
> >  	bool "Record functions that recurse in the ring buffer"
> >  	depends on FTRACE_RECORD_RECURSION
> > -- 
> > 2.43.0
> >   
> 
>
Andrii Nakryiko March 26, 2024, 4:16 p.m. UTC | #4
On Mon, Mar 25, 2024 at 3:11 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 25 Mar 2024 11:38:48 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > On Fri, 22 Mar 2024 09:03:23 -0700
> > Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> > > control whether ftrace low-level code performs additional
> > > rcu_is_watching()-based validation logic in an attempt to catch noinstr
> > > violations.
> > >
> > > This check is expected to never be true in practice and would be best
> > > controlled with extra config to let users decide if they are willing to
> > > pay the price.
> >
> > Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
> > so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > is OK, but tht should be set to Y by default. If you have already verified
> > that your system never make it true and you want to optimize your ftrace
> > path, you can manually set it to N at your own risk.
> >
>
> Really, it's for debugging. I would argue that it should *not* be default y.
> Peter added this to find all the locations that could be called where RCU
> is not watching. But the issue I have is that this is that it *does cause
> overhead* with function tracing.
>
> I believe we found pretty much all locations that were an issue, and we
> should now just make it an option for developers.
>
> It's no different than lockdep. Test boxes should have it enabled, but
> there's no reason to have this enabled in a production system.
>

I tend to agree with Steven here (which is why I sent this patch as it
is), but I'm happy to do it as an opt-out, if Masami insists. Please
do let me know if I need to send v2 or this one is actually the one
we'll end up using. Thanks!

> -- Steve
>
>
> > >
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Masami Hiramatsu <mhiramat@kernel.org>
> > > Cc: Paul E. McKenney <paulmck@kernel.org>
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  include/linux/trace_recursion.h |  2 +-
> > >  kernel/trace/Kconfig            | 13 +++++++++++++
> > >  2 files changed, 14 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
> > > index d48cd92d2364..24ea8ac049b4 100644
> > > --- a/include/linux/trace_recursion.h
> > > +++ b/include/linux/trace_recursion.h
> > > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip);
> > >  # define do_ftrace_record_recursion(ip, pip)       do { } while (0)
> > >  #endif
> > >
> > > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> > > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > >  # define trace_warn_on_no_rcu(ip)                                  \
> > >     ({                                                              \
> > >             bool __ret = !rcu_is_watching();                        \
> >
> > BTW, maybe we can add "unlikely" in the next "if" line?
> >
> > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > > index 61c541c36596..19bce4e217d6 100644
> > > --- a/kernel/trace/Kconfig
> > > +++ b/kernel/trace/Kconfig
> > > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> > >       This file can be reset, but the limit can not change in
> > >       size at runtime.
> > >
> > > +config FTRACE_VALIDATE_RCU_IS_WATCHING
> > > +   bool "Validate RCU is on during ftrace recursion check"
> > > +   depends on FUNCTION_TRACER
> > > +   depends on ARCH_WANTS_NO_INSTR
> >
> >       default y
> >
> > > +   help
> > > +     All callbacks that attach to the function tracing have some sort
> > > +     of protection against recursion. This option performs additional
> > > +     checks to make sure RCU is on when ftrace callbacks recurse.
> > > +
> > > +     This will add more overhead to all ftrace-based invocations.
> >
> >       ... invocations, but keep it safe.
> >
> > > +
> > > +     If unsure, say N
> >
> >       If unsure, say Y
> >
> > Thank you,
> >
> > > +
> > >  config RING_BUFFER_RECORD_RECURSION
> > >     bool "Record functions that recurse in the ring buffer"
> > >     depends on FTRACE_RECORD_RECURSION
> > > --
> > > 2.43.0
> > >
> >
> >
>
Steven Rostedt March 26, 2024, 7:01 p.m. UTC | #5
On Tue, 26 Mar 2024 09:16:33 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> > It's no different than lockdep. Test boxes should have it enabled, but
> > there's no reason to have this enabled in a production system.
> >  
> 
> I tend to agree with Steven here (which is why I sent this patch as it
> is), but I'm happy to do it as an opt-out, if Masami insists. Please
> do let me know if I need to send v2 or this one is actually the one
> we'll end up using. Thanks!

Masami,

Are you OK with just keeping it set to N.

We could have other options like PROVE_LOCKING enable it.

-- Steve
Andrii Nakryiko March 29, 2024, 4:39 p.m. UTC | #6
On Tue, Mar 26, 2024 at 11:58 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 26 Mar 2024 09:16:33 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > > It's no different than lockdep. Test boxes should have it enabled, but
> > > there's no reason to have this enabled in a production system.
> > >
> >
> > I tend to agree with Steven here (which is why I sent this patch as it
> > is), but I'm happy to do it as an opt-out, if Masami insists. Please
> > do let me know if I need to send v2 or this one is actually the one
> > we'll end up using. Thanks!
>
> Masami,
>
> Are you OK with just keeping it set to N.
>
> We could have other options like PROVE_LOCKING enable it.
>

So what's the conclusion, Masami? Should I send another version where
this config is opt-out, or are you ok with keeping it as opt-in as
proposed in this revision?

> -- Steve
Masami Hiramatsu (Google) April 1, 2024, 11:25 a.m. UTC | #7
On Tue, 26 Mar 2024 15:01:21 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 26 Mar 2024 09:16:33 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> > > It's no different than lockdep. Test boxes should have it enabled, but
> > > there's no reason to have this enabled in a production system.
> > >  
> > 
> > I tend to agree with Steven here (which is why I sent this patch as it
> > is), but I'm happy to do it as an opt-out, if Masami insists. Please
> > do let me know if I need to send v2 or this one is actually the one
> > we'll end up using. Thanks!
> 
> Masami,
> 
> Are you OK with just keeping it set to N.

OK, if it is only for the debugging, I'm OK to set N this.

> 
> We could have other options like PROVE_LOCKING enable it.

Agreed (but it should say this is a debug option)

Thank you,

> 
> -- Steve
Steven Rostedt April 1, 2024, 4:09 p.m. UTC | #8
On Mon, 1 Apr 2024 20:25:52 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > Masami,
> > 
> > Are you OK with just keeping it set to N.  
> 
> OK, if it is only for the debugging, I'm OK to set N this.
> 
> > 
> > We could have other options like PROVE_LOCKING enable it.  
> 
> Agreed (but it should say this is a debug option)

It does say "Validate" which to me is a debug option. What would you
suggest?

-- Steve
Masami Hiramatsu (Google) April 2, 2024, 12:38 a.m. UTC | #9
On Mon, 1 Apr 2024 12:09:18 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 1 Apr 2024 20:25:52 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > > Masami,
> > > 
> > > Are you OK with just keeping it set to N.  
> > 
> > OK, if it is only for the debugging, I'm OK to set N this.
> > 
> > > 
> > > We could have other options like PROVE_LOCKING enable it.  
> > 
> > Agreed (but it should say this is a debug option)
> 
> It does say "Validate" which to me is a debug option. What would you
> suggest?

I think the help message should have "This is for debugging ftrace."

Thank you,

> 
> -- Steve
Andrii Nakryiko April 2, 2024, 2:29 a.m. UTC | #10
On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Mon, 1 Apr 2024 12:09:18 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > On Mon, 1 Apr 2024 20:25:52 +0900
> > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> >
> > > > Masami,
> > > >
> > > > Are you OK with just keeping it set to N.
> > >
> > > OK, if it is only for the debugging, I'm OK to set N this.
> > >
> > > >
> > > > We could have other options like PROVE_LOCKING enable it.
> > >
> > > Agreed (but it should say this is a debug option)
> >
> > It does say "Validate" which to me is a debug option. What would you
> > suggest?
>
> I think the help message should have "This is for debugging ftrace."
>

Sent v2 with adjusted wording, thanks!

> Thank you,
>
> >
> > -- Steve
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
Steven Rostedt April 2, 2024, 2:47 a.m. UTC | #11
On Mon, 1 Apr 2024 19:29:46 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > On Mon, 1 Apr 2024 12:09:18 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >  
> > > On Mon, 1 Apr 2024 20:25:52 +0900
> > > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > >  
> > > > > Masami,
> > > > >
> > > > > Are you OK with just keeping it set to N.  
> > > >
> > > > OK, if it is only for the debugging, I'm OK to set N this.
> > > >  
> > > > >
> > > > > We could have other options like PROVE_LOCKING enable it.  
> > > >
> > > > Agreed (but it should say this is a debug option)  
> > >
> > > It does say "Validate" which to me is a debug option. What would you
> > > suggest?  
> >
> > I think the help message should have "This is for debugging ftrace."
> >  
> 
> Sent v2 with adjusted wording, thanks!

You may want to wait till Masami and I agree ;-)

Masami,

But it isn't really for "debugging", it's for validating. That is, it
doesn't give us any information to debug ftrace. It only validates if it is
executed properly. In other words, I never want to be asked "How can I use
this option to debug ftrace?"

For example, we also have:

  "Verify ring buffer time stamp deltas"

that makes sure the time stamps of the ring buffer are not buggy.

And there's:

  "Verify compile time sorting of ftrace functions"

Which is also used to make sure things are working properly.

Neither of the above says they are for "debugging", even though they are
more useful for debugging than this option.

I'm not sure saying this is "debugging ftrace" is accurate. It may help
debug ftrace if it is called outside of an RCU location, which has a
1 in 100,000,000,000 chance of causing an actual bug, as the race window is
extremely small. 

Now if it is also called outside of instrumentation, that will likely trigger
other warnings even without this code, and this will not be needed to debug
that.

ftrace has all sorts of "verifiers" that is used to make sure things are
working properly. And yes, you can consider it as "debugging". But I would
not consider this an option to enable if ftrace was broken, and you are
looking into why it is broken.

To me, this option is only to verify that ftrace (and other users of
ftrace_test_recursion_trylock()) are not called outside of RCU, as if they
are, it can cause a race. But it also has a noticeable overhead when enabled.

-- Steve


-- Steve
Masami Hiramatsu (Google) April 3, 2024, 12:40 a.m. UTC | #12
On Mon, 1 Apr 2024 22:47:33 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 1 Apr 2024 19:29:46 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> > On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > >
> > > On Mon, 1 Apr 2024 12:09:18 -0400
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > >  
> > > > On Mon, 1 Apr 2024 20:25:52 +0900
> > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > > >  
> > > > > > Masami,
> > > > > >
> > > > > > Are you OK with just keeping it set to N.  
> > > > >
> > > > > OK, if it is only for the debugging, I'm OK to set N this.
> > > > >  
> > > > > >
> > > > > > We could have other options like PROVE_LOCKING enable it.  
> > > > >
> > > > > Agreed (but it should say this is a debug option)  
> > > >
> > > > It does say "Validate" which to me is a debug option. What would you
> > > > suggest?  
> > >
> > > I think the help message should have "This is for debugging ftrace."
> > >  
> > 
> > Sent v2 with adjusted wording, thanks!
> 
> You may want to wait till Masami and I agree ;-)
> 
> Masami,
> 
> But it isn't really for "debugging", it's for validating. That is, it
> doesn't give us any information to debug ftrace. It only validates if it is
> executed properly. In other words, I never want to be asked "How can I use
> this option to debug ftrace?"
> 
> For example, we also have:
> 
>   "Verify ring buffer time stamp deltas"
> 
> that makes sure the time stamps of the ring buffer are not buggy.
> 
> And there's:
> 
>   "Verify compile time sorting of ftrace functions"
> 
> Which is also used to make sure things are working properly.
> 
> Neither of the above says they are for "debugging", even though they are
> more useful for debugging than this option.
> 
> I'm not sure saying this is "debugging ftrace" is accurate. It may help
> debug ftrace if it is called outside of an RCU location, which has a
> 1 in 100,000,000,000 chance of causing an actual bug, as the race window is
> extremely small. 
> 
> Now if it is also called outside of instrumentation, that will likely trigger
> other warnings even without this code, and this will not be needed to debug
> that.
> 
> ftrace has all sorts of "verifiers" that is used to make sure things are
> working properly. And yes, you can consider it as "debugging". But I would
> not consider this an option to enable if ftrace was broken, and you are
> looking into why it is broken.
> 
> To me, this option is only to verify that ftrace (and other users of
> ftrace_test_recursion_trylock()) are not called outside of RCU, as if they
> are, it can cause a race. But it also has a noticeable overhead when enabled.

OK, for me, this last sentence is preferred for the help message. That explains
what this is for. 

	  All callbacks that attach to the function tracing have some sort
	  of protection against recursion. This option is only to verify that
	 ftrace (and other users of ftrace_test_recursion_trylock()) are not
	  called outside of RCU, as if they are, it can cause a race.
	  But it also has a noticeable overhead when enabled.

BTW, how much overhead does this introduce? and the race case a kernel crash?
or just messed up the ftrace buffer?

Thank you,

> 
> -- Steve
> 
> 
> -- Steve
Steven Rostedt April 3, 2024, 12:54 a.m. UTC | #13
On Wed, 3 Apr 2024 09:40:48 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> OK, for me, this last sentence is preferred for the help message. That explains
> what this is for. 
> 
> 	  All callbacks that attach to the function tracing have some sort
> 	  of protection against recursion. This option is only to verify that
> 	 ftrace (and other users of ftrace_test_recursion_trylock()) are not
> 	  called outside of RCU, as if they are, it can cause a race.
> 	  But it also has a noticeable overhead when enabled.
> 
> BTW, how much overhead does this introduce? and the race case a kernel crash?
> or just messed up the ftrace buffer?

There's a hypothetical race where it can cause a use after free.

That is, after you shutdown ftrace, you need to call synchronize_rcu_tasks(),
which requires RCU to be watching. There's a theoretical case where that
task calls the trampoline and misses the synchronization. Note, these
locations are with preemption disabled, as rcu is always watching when
preemption is enabled. Thus it would be extremely fast where as the
synchronize_rcu_tasks() is rather slow.

We also have synchronize_rcu_tasks_rude() which would actually keep the
trace from happening, as it would schedule on each CPU forcing all CPUs to
have RCU watching.

I have never heard of this race being hit. I guess it could happen on a VM
where a vCPU gets preempted at the right moment for a long time and the
other CPUs synchronize.

But like lockdep, where deadlocks can crash the kernel, we don't enable it
for production.

The overhead is another function call within the function tracer. I had
numbers before, but I guess I could run tests again and get new numbers.

Thanks,

-- Steve
Andrii Nakryiko April 3, 2024, 4 a.m. UTC | #14
On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 3 Apr 2024 09:40:48 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > OK, for me, this last sentence is preferred for the help message. That explains
> > what this is for.
> >
> >         All callbacks that attach to the function tracing have some sort
> >         of protection against recursion. This option is only to verify that
> >        ftrace (and other users of ftrace_test_recursion_trylock()) are not
> >         called outside of RCU, as if they are, it can cause a race.
> >         But it also has a noticeable overhead when enabled.

Sounds good to me, I can add this to the description of the Kconfig option.

> >
> > BTW, how much overhead does this introduce? and the race case a kernel crash?

I just checked our fleet-wide production data for the last 24 hours.
Within the kprobe/kretprobe code path (ftrace_trampoline and
everything called from it), rcu_is_watching (both calls, see below)
cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
prefer to be able to avoid that in production use cases.

> > or just messed up the ftrace buffer?
>
> There's a hypothetical race where it can cause a use after free.
>
> That is, after you shutdown ftrace, you need to call synchronize_rcu_tasks(),
> which requires RCU to be watching. There's a theoretical case where that
> task calls the trampoline and misses the synchronization. Note, these
> locations are with preemption disabled, as rcu is always watching when
> preemption is enabled. Thus it would be extremely fast where as the
> synchronize_rcu_tasks() is rather slow.
>
> We also have synchronize_rcu_tasks_rude() which would actually keep the
> trace from happening, as it would schedule on each CPU forcing all CPUs to
> have RCU watching.
>
> I have never heard of this race being hit. I guess it could happen on a VM
> where a vCPU gets preempted at the right moment for a long time and the
> other CPUs synchronize.
>
> But like lockdep, where deadlocks can crash the kernel, we don't enable it
> for production.
>
> The overhead is another function call within the function tracer. I had
> numbers before, but I guess I could run tests again and get new numbers.
>

I just noticed another rcu_is_watching() call, in rethook_try_get(),
which seems to be a similar and complementary validation check to the
one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
in this patch. It feels like both of them should be controlled by the
same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
guard around rcu_is_watching() check in rethook_try_get() as well?


> Thanks,
>
> -- Steve
Andrii Nakryiko April 3, 2024, 5:21 a.m. UTC | #15
On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Wed, 3 Apr 2024 09:40:48 +0900
> > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> >
> > > OK, for me, this last sentence is preferred for the help message. That explains
> > > what this is for.
> > >
> > >         All callbacks that attach to the function tracing have some sort
> > >         of protection against recursion. This option is only to verify that
> > >        ftrace (and other users of ftrace_test_recursion_trylock()) are not
> > >         called outside of RCU, as if they are, it can cause a race.
> > >         But it also has a noticeable overhead when enabled.
>
> Sounds good to me, I can add this to the description of the Kconfig option.
>
> > >
> > > BTW, how much overhead does this introduce? and the race case a kernel crash?
>
> I just checked our fleet-wide production data for the last 24 hours.
> Within the kprobe/kretprobe code path (ftrace_trampoline and
> everything called from it), rcu_is_watching (both calls, see below)
> cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> prefer to be able to avoid that in production use cases.
>

I just ran synthetic microbenchmark testing multi-kretprobe
throughput. We get (in millions of BPF kretprobe-multi program
invocations per second):
  - 5.568M/s as baseline;
  - 5.679M/s with changes in this patch (+2% throughput improvement);
  - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
(+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).

It's definitely noticeable.

> > > or just messed up the ftrace buffer?
> >
> > There's a hypothetical race where it can cause a use after free.
> >
> > That is, after you shutdown ftrace, you need to call synchronize_rcu_tasks(),
> > which requires RCU to be watching. There's a theoretical case where that
> > task calls the trampoline and misses the synchronization. Note, these
> > locations are with preemption disabled, as rcu is always watching when
> > preemption is enabled. Thus it would be extremely fast where as the
> > synchronize_rcu_tasks() is rather slow.
> >
> > We also have synchronize_rcu_tasks_rude() which would actually keep the
> > trace from happening, as it would schedule on each CPU forcing all CPUs to
> > have RCU watching.
> >
> > I have never heard of this race being hit. I guess it could happen on a VM
> > where a vCPU gets preempted at the right moment for a long time and the
> > other CPUs synchronize.
> >
> > But like lockdep, where deadlocks can crash the kernel, we don't enable it
> > for production.
> >
> > The overhead is another function call within the function tracer. I had
> > numbers before, but I guess I could run tests again and get new numbers.
> >
>
> I just noticed another rcu_is_watching() call, in rethook_try_get(),
> which seems to be a similar and complementary validation check to the
> one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> in this patch. It feels like both of them should be controlled by the
> same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> guard around rcu_is_watching() check in rethook_try_get() as well?
>
>
> > Thanks,
> >
> > -- Steve
Steven Rostedt April 3, 2024, 1:53 p.m. UTC | #16
On Tue, 2 Apr 2024 21:00:19 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> I just noticed another rcu_is_watching() call, in rethook_try_get(),
> which seems to be a similar and complementary validation check to the
> one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> in this patch. It feels like both of them should be controlled by the
> same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> guard around rcu_is_watching() check in rethook_try_get() as well?

That is totally up to Masami. It may have even less overhead as I'm not
sure how many times that gets called, and there may be more work to do than
with function tracing.

-- Steve
Steven Rostedt April 3, 2024, 1:55 p.m. UTC | #17
On Tue, 2 Apr 2024 22:21:00 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> > I just checked our fleet-wide production data for the last 24 hours.
> > Within the kprobe/kretprobe code path (ftrace_trampoline and
> > everything called from it), rcu_is_watching (both calls, see below)
> > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> > prefer to be able to avoid that in production use cases.
> >  
> 
> I just ran synthetic microbenchmark testing multi-kretprobe
> throughput. We get (in millions of BPF kretprobe-multi program
> invocations per second):
>   - 5.568M/s as baseline;
>   - 5.679M/s with changes in this patch (+2% throughput improvement);
>   - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
> (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).
> 
> It's definitely noticeable.

Ah, thanks for verifying (I should have read the thread before replying to
the previous email).

-- Steve
Masami Hiramatsu (Google) April 6, 2024, 3:41 a.m. UTC | #18
On Tue, 2 Apr 2024 22:21:00 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Wed, 3 Apr 2024 09:40:48 +0900
> > > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > >
> > > > OK, for me, this last sentence is preferred for the help message. That explains
> > > > what this is for.
> > > >
> > > >         All callbacks that attach to the function tracing have some sort
> > > >         of protection against recursion. This option is only to verify that
> > > >        ftrace (and other users of ftrace_test_recursion_trylock()) are not
> > > >         called outside of RCU, as if they are, it can cause a race.
> > > >         But it also has a noticeable overhead when enabled.
> >
> > Sounds good to me, I can add this to the description of the Kconfig option.
> >
> > > >
> > > > BTW, how much overhead does this introduce? and the race case a kernel crash?
> >
> > I just checked our fleet-wide production data for the last 24 hours.
> > Within the kprobe/kretprobe code path (ftrace_trampoline and
> > everything called from it), rcu_is_watching (both calls, see below)
> > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> > prefer to be able to avoid that in production use cases.
> >
> 
> I just ran synthetic microbenchmark testing multi-kretprobe
> throughput. We get (in millions of BPF kretprobe-multi program
> invocations per second):
>   - 5.568M/s as baseline;
>   - 5.679M/s with changes in this patch (+2% throughput improvement);
>   - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
> (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).
> 
> It's definitely noticeable.

Thanks for checking the overhead! Hmm, it is considerable.

> > > > or just messed up the ftrace buffer?
> > >
> > > There's a hypothetical race where it can cause a use after free.

Hmm, so it might not lead a kernel crash but is better to enable with
other debugging options.

> > >
> > > That is, after you shutdown ftrace, you need to call synchronize_rcu_tasks(),
> > > which requires RCU to be watching. There's a theoretical case where that
> > > task calls the trampoline and misses the synchronization. Note, these
> > > locations are with preemption disabled, as rcu is always watching when
> > > preemption is enabled. Thus it would be extremely fast where as the
> > > synchronize_rcu_tasks() is rather slow.
> > >
> > > We also have synchronize_rcu_tasks_rude() which would actually keep the
> > > trace from happening, as it would schedule on each CPU forcing all CPUs to
> > > have RCU watching.
> > >
> > > I have never heard of this race being hit. I guess it could happen on a VM
> > > where a vCPU gets preempted at the right moment for a long time and the
> > > other CPUs synchronize.
> > >
> > > But like lockdep, where deadlocks can crash the kernel, we don't enable it
> > > for production.
> > >
> > > The overhead is another function call within the function tracer. I had
> > > numbers before, but I guess I could run tests again and get new numbers.
> > >
> >
> > I just noticed another rcu_is_watching() call, in rethook_try_get(),
> > which seems to be a similar and complementary validation check to the
> > one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> > in this patch. It feels like both of them should be controlled by the
> > same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > guard around rcu_is_watching() check in rethook_try_get() as well?

Hmmm, no, I think it should not change the rethook side because rethook
can be used with kprobes without ftrace. If we can detect it is used from
the ftrace, we can skip it. (From this reason, I would like to remove
return probe from kprobes...)

Thank you,

> >
> >
> > > Thanks,
> > >
> > > -- Steve
Andrii Nakryiko April 6, 2024, 4:06 p.m. UTC | #19
On Fri, Apr 5, 2024 at 8:41 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Tue, 2 Apr 2024 22:21:00 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > > >
> > > > On Wed, 3 Apr 2024 09:40:48 +0900
> > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > > >
> > > > > OK, for me, this last sentence is preferred for the help message. That explains
> > > > > what this is for.
> > > > >
> > > > >         All callbacks that attach to the function tracing have some sort
> > > > >         of protection against recursion. This option is only to verify that
> > > > >        ftrace (and other users of ftrace_test_recursion_trylock()) are not
> > > > >         called outside of RCU, as if they are, it can cause a race.
> > > > >         But it also has a noticeable overhead when enabled.
> > >
> > > Sounds good to me, I can add this to the description of the Kconfig option.
> > >
> > > > >
> > > > > BTW, how much overhead does this introduce? and the race case a kernel crash?
> > >
> > > I just checked our fleet-wide production data for the last 24 hours.
> > > Within the kprobe/kretprobe code path (ftrace_trampoline and
> > > everything called from it), rcu_is_watching (both calls, see below)
> > > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> > > prefer to be able to avoid that in production use cases.
> > >
> >
> > I just ran synthetic microbenchmark testing multi-kretprobe
> > throughput. We get (in millions of BPF kretprobe-multi program
> > invocations per second):
> >   - 5.568M/s as baseline;
> >   - 5.679M/s with changes in this patch (+2% throughput improvement);
> >   - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
> > (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).
> >
> > It's definitely noticeable.
>
> Thanks for checking the overhead! Hmm, it is considerable.
>
> > > > > or just messed up the ftrace buffer?
> > > >
> > > > There's a hypothetical race where it can cause a use after free.
>
> Hmm, so it might not lead a kernel crash but is better to enable with
> other debugging options.
>
> > > >
> > > > That is, after you shutdown ftrace, you need to call synchronize_rcu_tasks(),
> > > > which requires RCU to be watching. There's a theoretical case where that
> > > > task calls the trampoline and misses the synchronization. Note, these
> > > > locations are with preemption disabled, as rcu is always watching when
> > > > preemption is enabled. Thus it would be extremely fast where as the
> > > > synchronize_rcu_tasks() is rather slow.
> > > >
> > > > We also have synchronize_rcu_tasks_rude() which would actually keep the
> > > > trace from happening, as it would schedule on each CPU forcing all CPUs to
> > > > have RCU watching.
> > > >
> > > > I have never heard of this race being hit. I guess it could happen on a VM
> > > > where a vCPU gets preempted at the right moment for a long time and the
> > > > other CPUs synchronize.
> > > >
> > > > But like lockdep, where deadlocks can crash the kernel, we don't enable it
> > > > for production.
> > > >
> > > > The overhead is another function call within the function tracer. I had
> > > > numbers before, but I guess I could run tests again and get new numbers.
> > > >
> > >
> > > I just noticed another rcu_is_watching() call, in rethook_try_get(),
> > > which seems to be a similar and complementary validation check to the
> > > one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> > > in this patch. It feels like both of them should be controlled by the
> > > same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > > guard around rcu_is_watching() check in rethook_try_get() as well?
>
> Hmmm, no, I think it should not change the rethook side because rethook
> can be used with kprobes without ftrace. If we can detect it is used from

It's a good thing that I split that into a separate patch, then.
Hopefully the first patch looks good and you can apply it as is.

> the ftrace, we can skip it. (From this reason, I would like to remove
> return probe from kprobes...)

I'm on PTO for the next two weeks and I can take a look at more
properly guarding rcu_is_watching() in rethook_try_get() when I'm
back. Thanks.

>
> Thank you,
>
> > >
> > >
> > > > Thanks,
> > > >
> > > > -- Steve
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
diff mbox series

Patch

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@  extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)	do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)					\
 	({								\
 		bool __ret = !rcu_is_watching();			\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..19bce4e217d6 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,19 @@  config FTRACE_RECORD_RECURSION_SIZE
 	  This file can be reset, but the limit can not change in
 	  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+	bool "Validate RCU is on during ftrace recursion check"
+	depends on FUNCTION_TRACER
+	depends on ARCH_WANTS_NO_INSTR
+	help
+	  All callbacks that attach to the function tracing have some sort
+	  of protection against recursion. This option performs additional
+	  checks to make sure RCU is on when ftrace callbacks recurse.
+
+	  This will add more overhead to all ftrace-based invocations.
+
+	  If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
 	bool "Record functions that recurse in the ring buffer"
 	depends on FTRACE_RECORD_RECURSION