diff mbox series

[v3,28/46] kmsan: entry: handle register passing from uninstrumented code

Message ID 20220426164315.625149-29-glider@google.com (mailing list archive)
State New
Headers show
Series Add KernelMemorySanitizer infrastructure | expand

Commit Message

Alexander Potapenko April 26, 2022, 4:42 p.m. UTC
Replace instrumentation_begin()	with instrumentation_begin_with_regs()
to let KMSAN handle the non-instrumented code and unpoison pt_regs
passed from the instrumented part.

Signed-off-by: Alexander Potapenko <glider@google.com>
---
Link: https://linux-review.googlesource.com/id/I7f0a9809b66bd85faae43142971d0095771b7a42
---
 kernel/entry/common.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

Comments

Thomas Gleixner April 27, 2022, 1:32 p.m. UTC | #1
On Tue, Apr 26 2022 at 18:42, Alexander Potapenko wrote:

Can you please use 'entry:' as prefix. Slapping kmsan in front of
everything does not really make sense.

> Replace instrumentation_begin()	with instrumentation_begin_with_regs()
> to let KMSAN handle the non-instrumented code and unpoison pt_regs
> passed from the instrumented part.

That should be:

     from the non-instrumented part
or
     passed to the instrumented part

right?

> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -23,7 +23,7 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
>  	CT_WARN_ON(ct_state() != CONTEXT_USER);
>  	user_exit_irqoff();
>  
> -	instrumentation_begin();
> +	instrumentation_begin_with_regs(regs);

I can see what you are trying to do, but this will end up doing the same
thing over and over. Let's just look at a syscall.

__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
        ...
	nr = syscall_enter_from_user_mode(regs, nr)

  		__enter_from_user_mode(regs)
              		.....
			instrumentation_begin_with_regs(regs);
			....

                instrumentation_begin_with_regs(regs);
                ....                     

	instrumentation_begin_with_regs(regs);

	if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
		/* Invalid system call, but still a system call. */
		regs->ax = __x64_sys_ni_syscall(regs);
	}

	instrumentation_end();

        syscall_exit_to_user_mode(regs);
		instrumentation_begin_with_regs(regs);
  		__syscall_exit_to_user_mode_work(regs);
  	instrumentation_end();
  	__exit_to_user_mode();

That means you memset state four times and unpoison regs four times. I'm
not sure whether that's desired.

instrumentation_begin()/end() are not really suitable IMO. They were
added to allow objtool to validate that nothing escapes into
instrumentable code unless annotated accordingly.

Thanks,

        tglx
Alexander Potapenko May 2, 2022, 5 p.m. UTC | #2
On Wed, Apr 27, 2022 at 3:32 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, Apr 26 2022 at 18:42, Alexander Potapenko wrote:
>
> Can you please use 'entry:' as prefix. Slapping kmsan in front of
> everything does not really make sense.
Sure, will do.

> > Replace instrumentation_begin()       with instrumentation_begin_with_regs()
> > to let KMSAN handle the non-instrumented code and unpoison pt_regs
> > passed from the instrumented part.
>
> That should be:
>
>      from the non-instrumented part
> or
>      passed to the instrumented part
>
> right?

That should be "from the non-instrumented part", you are right.

> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -23,7 +23,7 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
> >       CT_WARN_ON(ct_state() != CONTEXT_USER);
> >       user_exit_irqoff();
> >
> > -     instrumentation_begin();
> > +     instrumentation_begin_with_regs(regs);
>
> I can see what you are trying to do, but this will end up doing the same
> thing over and over. Let's just look at a syscall.
>
> __visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
> {
>         ...
>         nr = syscall_enter_from_user_mode(regs, nr)
>
>                 __enter_from_user_mode(regs)
>                         .....
>                         instrumentation_begin_with_regs(regs);
>                         ....
>
>                 instrumentation_begin_with_regs(regs);
>                 ....
>
>         instrumentation_begin_with_regs(regs);
>
>         if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
>                 /* Invalid system call, but still a system call. */
>                 regs->ax = __x64_sys_ni_syscall(regs);
>         }
>
>         instrumentation_end();
>
>         syscall_exit_to_user_mode(regs);
>                 instrumentation_begin_with_regs(regs);
>                 __syscall_exit_to_user_mode_work(regs);
>         instrumentation_end();
>         __exit_to_user_mode();
>
> That means you memset state four times and unpoison regs four times. I'm
> not sure whether that's desired.

Regarding the regs, you are right. It should be enough to unpoison the
regs at idtentry prologue instead.
I tried that initially, but IIRC it required patching each of the
DEFINE_IDTENTRY_XXX macros, which already use instrumentation_begin().
This decision can probably be revisited.

As for the state, what we are doing here is still not enough, although
it appears to work.

Every time an instrumented function calls another function, it sets up
the metadata for the function arguments in the per-task struct
kmsan_context_state.
Similarly, every instrumented function expects its caller to put the
metadata into that structure.
Now, if a non-instrumented function (e.g. every `noinstr` function)
calls an instrumented one (which happens inside the
instrumentation_begin()/instrumentation_end() region), nobody sets up
the state for that instrumented function, so it may report false
positives when accessing its arguments, if there are leftover poisoned
values in the state.

To overcome this problem, ideally we need to wipe kmsan_context_state
every time a call from the non-instrumented function occurs.
But this cannot be done automatically exactly because we cannot
instrument the named function :)

We therefore apply an approximation, wiping the state at the point of
the first transition between instrumented and non-instrumented code.
Because poison values are generally rare, and instrumented regions
tend to be short, it is unlikely that further calls from the same
non-instrumented function will result in false positives.
Yet it is not completely impossible, so wiping the state for the
second/third etc. time won't hurt.

>
> instrumentation_begin()/end() are not really suitable IMO. They were
> added to allow objtool to validate that nothing escapes into
> instrumentable code unless annotated accordingly.

An alternative to this would be adding some extra code unpoisoning the
state to every non-instrumented function that contains an instrumented
region.
That code would have to precede the first instrumentation_begin()
anyway, so I thought it would be reasonable to piggyback on the
existing annotation.

>
> Thanks,
>
>         tglx
Thomas Gleixner May 2, 2022, 10 p.m. UTC | #3
Alexander,

On Mon, May 02 2022 at 19:00, Alexander Potapenko wrote:
> On Wed, Apr 27, 2022 at 3:32 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > --- a/kernel/entry/common.c
>> > +++ b/kernel/entry/common.c
>> > @@ -23,7 +23,7 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
>> >       CT_WARN_ON(ct_state() != CONTEXT_USER);
>> >       user_exit_irqoff();
>> >
>> > -     instrumentation_begin();
>> > +     instrumentation_begin_with_regs(regs);
>>
>> I can see what you are trying to do, but this will end up doing the same
>> thing over and over. Let's just look at a syscall.
>>
>> __visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
>> {
>>         ...
>>         nr = syscall_enter_from_user_mode(regs, nr)
>>
>>                 __enter_from_user_mode(regs)
>>                         .....
>>                         instrumentation_begin_with_regs(regs);
>>                         ....
>>
>>                 instrumentation_begin_with_regs(regs);
>>                 ....
>>
>>         instrumentation_begin_with_regs(regs);
>>
>>         if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
>>                 /* Invalid system call, but still a system call. */
>>                 regs->ax = __x64_sys_ni_syscall(regs);
>>         }
>>
>>         instrumentation_end();
>>
>>         syscall_exit_to_user_mode(regs);
>>                 instrumentation_begin_with_regs(regs);
>>                 __syscall_exit_to_user_mode_work(regs);
>>         instrumentation_end();
>>         __exit_to_user_mode();
>>
>> That means you memset state four times and unpoison regs four times. I'm
>> not sure whether that's desired.
>
> Regarding the regs, you are right. It should be enough to unpoison the
> regs at idtentry prologue instead.
> I tried that initially, but IIRC it required patching each of the
> DEFINE_IDTENTRY_XXX macros, which already use instrumentation_begin().

Exactly 4 instances :)

> This decision can probably be revisited.

It has to be revisited because the whole thing is incomplete if this is
not addressed.

> As for the state, what we are doing here is still not enough, although
> it appears to work.
>
> Every time an instrumented function calls another function, it sets up
> the metadata for the function arguments in the per-task struct
> kmsan_context_state.
> Similarly, every instrumented function expects its caller to put the
> metadata into that structure.
> Now, if a non-instrumented function (e.g. every `noinstr` function)
> calls an instrumented one (which happens inside the
> instrumentation_begin()/instrumentation_end() region), nobody sets up
> the state for that instrumented function, so it may report false
> positives when accessing its arguments, if there are leftover poisoned
> values in the state.
>
> To overcome this problem, ideally we need to wipe kmsan_context_state
> every time a call from the non-instrumented function occurs.
> But this cannot be done automatically exactly because we cannot
> instrument the named function :)
>
> We therefore apply an approximation, wiping the state at the point of
> the first transition between instrumented and non-instrumented code.
> Because poison values are generally rare, and instrumented regions
> tend to be short, it is unlikely that further calls from the same
> non-instrumented function will result in false positives.
> Yet it is not completely impossible, so wiping the state for the
> second/third etc. time won't hurt.

Understood. But if I understand you correctly:

> Similarly, every instrumented function expects its caller to put the
> metadata into that structure.

then

     instrumentation_begin();
     foo(fargs...);
     bar(bargs...);
     instrumentation_end();

is a source of potential false positives because the state is not
guaranteed to be correct, neither for foo() nor for bar(), even if you
wipe the state in instrumentation_begin(), right?

This approximation approach smells fishy and it's inevitably going to be
a constant source of 'add yet another kmsan annotation/fixup' patches,
which I'm not interested in at all.

As this needs compiler support anyway, then why not doing the obvious:

#define noinstr                                 \
        .... __kmsan_conditional

#define instrumentation_begin()                 \
        ..... __kmsan_cond_begin

#define instrumentation_end()                   \
        __kmsan_cond_end .......

and let the compiler stick whatever is required into that code section
between instrumentation_begin() and instrumentation_end()?

That's not violating any of the noinstr constraints at all. In fact we
allow _any_ instrumentation to be placed between this two points. We
have tracepoints there today.

We could also allow breakpoints, kprobes or whatever, but handling this
at that granularity level for a production kernel is just overkill and
the code in those instrumentable sections is usually not that
interesting as it's mostly function calls.

But if the compiler converts

     instrumentation_begin();
     foo(fargs...);
     bar(bargs...);
     instrumentation_end();

to

     instrumentation_begin();
     kmsan_instr_begin_magic();
     kmsan_magic(fargs...);
     foo(fargs...);
     kmsan_magic(bargs...);
     bar(bargs...);
     kmsan_instr_end_magic();
     instrumentation_end();

for the kmsan case and leaves anything outside of these sections alone,
then you have:

   - a minimal code change
   - the best possible coverage
   - the least false positive crap to chase and annotate

IOW, a solution which is solid and future proof.

I'm all for making use of advanced instrumentation, validation and
debugging features, but this mindset of 'make the code comply to what
the tool of today provides' is fundamentally wrong. Tools have to
provide value to the programmer and not the other way round.

Yes, it's more work on the tooling side, but the tooling side is mostly
a one time effort while chasing the false positives is a long term
nightmare.

Thanks,

        tglx
Alexander Potapenko May 5, 2022, 6:04 p.m. UTC | #4
On Tue, May 3, 2022 at 12:00 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Alexander,

First of all, thanks a lot for the comments, those are greatly appreciated!
I tried to revert this patch and the previous one ("kmsan:
instrumentation.h: add instrumentation_begin_with_regs()") and
reimplement unpoisoning pt_regs without breaking into
instrumentation_begin(), see below.

> >
> > Regarding the regs, you are right. It should be enough to unpoison the
> > regs at idtentry prologue instead.
> > I tried that initially, but IIRC it required patching each of the
> > DEFINE_IDTENTRY_XXX macros, which already use instrumentation_begin().
>
> Exactly 4 instances :)
>

Not really, I had to add a call to `kmsan_unpoison_memory(regs,
sizeof(*regs));` to the following places in
arch/x86/include/asm/idtentry.h:
- DEFINE_IDTENTRY()
- DEFINE_IDTENTRY_ERRORCODE()
- DEFINE_IDTENTRY_RAW()
- DEFINE_IDTENTRY_RAW_ERRORCODE()
- DEFINE_IDTENTRY_IRQ()
- DEFINE_IDTENTRY_SYSVEC()
- DEFINE_IDTENTRY_SYSVEC_SIMPLE()
- DEFINE_IDTENTRY_DF()

, but even that wasn't enough. For some reason I also had to unpoison
pt_regs directly in
DEFINE_IDTENTRY_SYSVEC(sysvec_apic_timer_interrupt) and
DEFINE_IDTENTRY_IRQ(common_interrupt).
In the latter case, this could have been caused by
asm_common_interrupt being entered from irq_entries_start(), but I am
not sure what is so special about sysvec_apic_timer_interrupt().

Ideally, it would be great to find that single point where pt_regs are
set up before being passed to all IDT entries.
I used to do that by inserting calls to kmsan_unpoison_memory right
into arch/x86/entry/entry_64.S
(https://github.com/google/kmsan/commit/3b0583f45f74f3a09f4c7e0e0588169cef918026),
but that required storing/restoring all GP registers. Maybe there's a
better way?


>
> then
>
>      instrumentation_begin();
>      foo(fargs...);
>      bar(bargs...);
>      instrumentation_end();
>
> is a source of potential false positives because the state is not
> guaranteed to be correct, neither for foo() nor for bar(), even if you
> wipe the state in instrumentation_begin(), right?

Yes, this is right.

> This approximation approach smells fishy and it's inevitably going to be
> a constant source of 'add yet another kmsan annotation/fixup' patches,
> which I'm not interested in at all.
>
> As this needs compiler support anyway, then why not doing the obvious:
>
> #define noinstr                                 \
>         .... __kmsan_conditional
>
> #define instrumentation_begin()                 \
>         ..... __kmsan_cond_begin
>
> #define instrumentation_end()                   \
>         __kmsan_cond_end .......
>
> and let the compiler stick whatever is required into that code section
> between instrumentation_begin() and instrumentation_end()?

We define noinstr as
__attribute__((disable_sanitizer_instrumentation))
(https://llvm.org/docs/LangRef.html#:~:text=disable_sanitizer_instrumentation),
which means no instrumentation will be applied to the annotated
function.
Changing that behavior by adding subregions that can be instrumented
sounds questionable.
C also doesn't have good syntactic means to define these subregions -
perhaps some __xxx_begin()/__xxx_end() intrinsics would work, but they
would require both compile-time and run-time validation.

Fortunately, I don't think we need to insert extra instrumentation
into instrumentation_begin()/instrumentation_end() regions.

What I have in mind is adding a bool flag to kmsan_context_state, that
the instrumentation sets to true before the function call.
When entering an instrumented function, KMSAN would check that flag
and set it to false, so that the context state can be only used once.
If a function is called from another instrumented function, the
context state is properly set up, and there is nothing to worry about.
If it is called from non-instrumented code (either noinstr or the
skipped files that have KMSAN_SANITIZE:=n), KMSAN would detect that
and wipe the context state before use.

By the way, I've noticed that at least for now (with pt_regs
unpoisoning performed in IDT entries) the problem with false positives
in noinstr code is entirely gone, so maybe we don't even have to
bother.

> Yes, it's more work on the tooling side, but the tooling side is mostly
> a one time effort while chasing the false positives is a long term
> nightmare.

Well said.

> Thanks,
>
>         tglx
Thomas Gleixner May 5, 2022, 9:56 p.m. UTC | #5
Alexander,

On Thu, May 05 2022 at 20:04, Alexander Potapenko wrote:
> On Tue, May 3, 2022 at 12:00 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Regarding the regs, you are right. It should be enough to unpoison the
>> > regs at idtentry prologue instead.
>> > I tried that initially, but IIRC it required patching each of the
>> > DEFINE_IDTENTRY_XXX macros, which already use instrumentation_begin().
>>
>> Exactly 4 instances :)
>>
>
> Not really, I had to add a call to `kmsan_unpoison_memory(regs,
> sizeof(*regs));` to the following places in
> arch/x86/include/asm/idtentry.h:
> - DEFINE_IDTENTRY()
> - DEFINE_IDTENTRY_ERRORCODE()
> - DEFINE_IDTENTRY_RAW()
> - DEFINE_IDTENTRY_RAW_ERRORCODE()
> - DEFINE_IDTENTRY_IRQ()
> - DEFINE_IDTENTRY_SYSVEC()
> - DEFINE_IDTENTRY_SYSVEC_SIMPLE()
> - DEFINE_IDTENTRY_DF()
>
> , but even that wasn't enough. For some reason I also had to unpoison
> pt_regs directly in
> DEFINE_IDTENTRY_SYSVEC(sysvec_apic_timer_interrupt) and
> DEFINE_IDTENTRY_IRQ(common_interrupt).
> In the latter case, this could have been caused by
> asm_common_interrupt being entered from irq_entries_start(), but I am
> not sure what is so special about sysvec_apic_timer_interrupt().
>
> Ideally, it would be great to find that single point where pt_regs are
> set up before being passed to all IDT entries.
> I used to do that by inserting calls to kmsan_unpoison_memory right
> into arch/x86/entry/entry_64.S
> (https://github.com/google/kmsan/commit/3b0583f45f74f3a09f4c7e0e0588169cef918026),
> but that required storing/restoring all GP registers. Maybe there's a
> better way?

Yes. Something like this should cover all exceptions and syscalls before
anything instrumentable can touch @regs. Anything up to those points is
either off-limit for instrumentation or does not deal with @regs.

--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -24,6 +24,7 @@ static __always_inline void __enter_from
 	user_exit_irqoff();
 
 	instrumentation_begin();
+	unpoison(regs);
 	trace_hardirqs_off_finish();
 	instrumentation_end();
 }
@@ -352,6 +353,7 @@ noinstr irqentry_state_t irqentry_enter(
 		lockdep_hardirqs_off(CALLER_ADDR0);
 		rcu_irq_enter();
 		instrumentation_begin();
+		unpoison(regs);
 		trace_hardirqs_off_finish();
 		instrumentation_end();
 
@@ -367,6 +369,7 @@ noinstr irqentry_state_t irqentry_enter(
 	 */
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	instrumentation_begin();
+	unpoison(regs);
 	rcu_irq_enter_check_tick();
 	trace_hardirqs_off_finish();
 	instrumentation_end();
@@ -452,6 +455,7 @@ irqentry_state_t noinstr irqentry_nmi_en
 	rcu_nmi_enter();
 
 	instrumentation_begin();
+	unpoison(regs);
 	trace_hardirqs_off_finish();
 	ftrace_nmi_enter();
 	instrumentation_end();

As I said: 4 places :)

> Fortunately, I don't think we need to insert extra instrumentation
> into instrumentation_begin()/instrumentation_end() regions.
>
> What I have in mind is adding a bool flag to kmsan_context_state, that
> the instrumentation sets to true before the function call.
> When entering an instrumented function, KMSAN would check that flag
> and set it to false, so that the context state can be only used once.
> If a function is called from another instrumented function, the
> context state is properly set up, and there is nothing to worry about.
> If it is called from non-instrumented code (either noinstr or the
> skipped files that have KMSAN_SANITIZE:=n), KMSAN would detect that
> and wipe the context state before use.

Sounds good.

Thanks,

	tglx
Alexander Potapenko May 6, 2022, 2:52 p.m. UTC | #6
On Thu, May 5, 2022 at 11:56 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Alexander,
>
> On Thu, May 05 2022 at 20:04, Alexander Potapenko wrote:
> > On Tue, May 3, 2022 at 12:00 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > Regarding the regs, you are right. It should be enough to unpoison the
> >> > regs at idtentry prologue instead.
> >> > I tried that initially, but IIRC it required patching each of the
> >> > DEFINE_IDTENTRY_XXX macros, which already use instrumentation_begin().
> >>
> >> Exactly 4 instances :)
> >>
> >
> > Not really, I had to add a call to `kmsan_unpoison_memory(regs,
> > sizeof(*regs));` to the following places in
> > arch/x86/include/asm/idtentry.h:
> > - DEFINE_IDTENTRY()
> > - DEFINE_IDTENTRY_ERRORCODE()
> > - DEFINE_IDTENTRY_RAW()
> > - DEFINE_IDTENTRY_RAW_ERRORCODE()
> > - DEFINE_IDTENTRY_IRQ()
> > - DEFINE_IDTENTRY_SYSVEC()
> > - DEFINE_IDTENTRY_SYSVEC_SIMPLE()
> > - DEFINE_IDTENTRY_DF()
> >
> > , but even that wasn't enough. For some reason I also had to unpoison
> > pt_regs directly in
> > DEFINE_IDTENTRY_SYSVEC(sysvec_apic_timer_interrupt) and
> > DEFINE_IDTENTRY_IRQ(common_interrupt).
> > In the latter case, this could have been caused by
> > asm_common_interrupt being entered from irq_entries_start(), but I am
> > not sure what is so special about sysvec_apic_timer_interrupt().
> >
> > Ideally, it would be great to find that single point where pt_regs are
> > set up before being passed to all IDT entries.
> > I used to do that by inserting calls to kmsan_unpoison_memory right
> > into arch/x86/entry/entry_64.S
> > (https://github.com/google/kmsan/commit/3b0583f45f74f3a09f4c7e0e0588169cef918026),
> > but that required storing/restoring all GP registers. Maybe there's a
> > better way?
>
> Yes. Something like this should cover all exceptions and syscalls before
> anything instrumentable can touch @regs. Anything up to those points is
> either off-limit for instrumentation or does not deal with @regs.
>
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -24,6 +24,7 @@ static __always_inline void __enter_from
>         user_exit_irqoff();
>
>         instrumentation_begin();
> +       unpoison(regs);
>         trace_hardirqs_off_finish();
>         instrumentation_end();
>  }
> @@ -352,6 +353,7 @@ noinstr irqentry_state_t irqentry_enter(
>                 lockdep_hardirqs_off(CALLER_ADDR0);
>                 rcu_irq_enter();
>                 instrumentation_begin();
> +               unpoison(regs);
>                 trace_hardirqs_off_finish();
>                 instrumentation_end();
>
> @@ -367,6 +369,7 @@ noinstr irqentry_state_t irqentry_enter(
>          */
>         lockdep_hardirqs_off(CALLER_ADDR0);
>         instrumentation_begin();
> +       unpoison(regs);
>         rcu_irq_enter_check_tick();
>         trace_hardirqs_off_finish();
>         instrumentation_end();
> @@ -452,6 +455,7 @@ irqentry_state_t noinstr irqentry_nmi_en
>         rcu_nmi_enter();
>
>         instrumentation_begin();
> +       unpoison(regs);
>         trace_hardirqs_off_finish();
>         ftrace_nmi_enter();
>         instrumentation_end();
>
> As I said: 4 places :)

These four instances still do not look sufficient.
Right now I am seeing e.g. reports with the following stack trace:

BUG: KMSAN: uninit-value in irqtime_account_process_tick+0x255/0x580
kernel/sched/cputime.c:382
 irqtime_account_process_tick+0x255/0x580 kernel/sched/cputime.c:382
 account_process_tick+0x98/0x450 kernel/sched/cputime.c:476
 update_process_times+0xe4/0x3e0 kernel/time/timer.c:1786
 tick_sched_handle kernel/time/tick-sched.c:243
 tick_sched_timer+0x83e/0x9e0 kernel/time/tick-sched.c:1473
 __run_hrtimer+0x518/0xe50 kernel/time/hrtimer.c:1685
 __hrtimer_run_queues kernel/time/hrtimer.c:1749
 hrtimer_interrupt+0x838/0x15a0 kernel/time/hrtimer.c:1811
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086
 __sysvec_apic_timer_interrupt+0x1ae/0x680 arch/x86/kernel/apic/apic.c:1103
 sysvec_apic_timer_interrupt+0x95/0xc0 arch/x86/kernel/apic/apic.c:1097
...
(uninit creation stack trace is irrelevant here, because it is some
random value from the stack)

sysvec_apic_timer_interrupt() receives struct pt_regs from
uninstrumented code, so regs can be partially uninitialized.
They are not passed down the call stack directly, but are instead
saved by set_irq_regs() in sysvec_apic_timer_interrupt() and loaded by
get_irq_regs() in tick_sched_timer().

The remaining false positives can be fixed by unpoisoning the
registers in set_irq_regs():

 static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs)
 {
        struct pt_regs *old_regs;
+       kmsan_unpoison_memory(new_regs, sizeof(*new_regs));

        old_regs = __this_cpu_read(__irq_regs);
        __this_cpu_write(__irq_regs, new_regs);

Does that sound viable? Is it correct to assume that set_irq_regs() is
always called for registers received from non-instrumented code?

(It seems that just unpoisoning registers in set_irq_regs() is not
enough, i.e. we still need to do what you suggest in
kernel/entry/common.c)
--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise
erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes
weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich
bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.


This e-mail is confidential. If you received this communication by
mistake, please don't forward it to anyone else, please erase all
copies and attachments, and please let me know that it has gone to the
wrong person.
Thomas Gleixner May 6, 2022, 4:14 p.m. UTC | #7
On Fri, May 06 2022 at 16:52, Alexander Potapenko wrote:
> On Thu, May 5, 2022 at 11:56 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> @@ -452,6 +455,7 @@ irqentry_state_t noinstr irqentry_nmi_en
>>         rcu_nmi_enter();
>>
>>         instrumentation_begin();
>> +       unpoison(regs);
>>         trace_hardirqs_off_finish();
>>         ftrace_nmi_enter();
>>         instrumentation_end();
>>
>> As I said: 4 places :)
>
> These four instances still do not look sufficient.
> Right now I am seeing e.g. reports with the following stack trace:
>
> BUG: KMSAN: uninit-value in irqtime_account_process_tick+0x255/0x580
> kernel/sched/cputime.c:382
>  irqtime_account_process_tick+0x255/0x580 kernel/sched/cputime.c:382
>  account_process_tick+0x98/0x450 kernel/sched/cputime.c:476
>  update_process_times+0xe4/0x3e0 kernel/time/timer.c:1786
>  tick_sched_handle kernel/time/tick-sched.c:243
>  tick_sched_timer+0x83e/0x9e0 kernel/time/tick-sched.c:1473
>  __run_hrtimer+0x518/0xe50 kernel/time/hrtimer.c:1685
>  __hrtimer_run_queues kernel/time/hrtimer.c:1749
>  hrtimer_interrupt+0x838/0x15a0 kernel/time/hrtimer.c:1811
>  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086
>  __sysvec_apic_timer_interrupt+0x1ae/0x680 arch/x86/kernel/apic/apic.c:1103
>  sysvec_apic_timer_interrupt+0x95/0xc0 arch/x86/kernel/apic/apic.c:1097
> ...
> (uninit creation stack trace is irrelevant here, because it is some
> random value from the stack)
>
> sysvec_apic_timer_interrupt() receives struct pt_regs from
> uninstrumented code, so regs can be partially uninitialized.
> They are not passed down the call stack directly, but are instead
> saved by set_irq_regs() in sysvec_apic_timer_interrupt() and loaded by
> get_irq_regs() in tick_sched_timer().

sysvec_apic_timer_interrupt() invokes irqentry_enter() _before_
set_irq_regs() and irqentry_enter() unpoisons @reg.

Confused...
Alexander Potapenko May 6, 2022, 5:41 p.m. UTC | #8
On Fri, May 6, 2022 at 6:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Fri, May 06 2022 at 16:52, Alexander Potapenko wrote:
> > On Thu, May 5, 2022 at 11:56 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> @@ -452,6 +455,7 @@ irqentry_state_t noinstr irqentry_nmi_en
> >>         rcu_nmi_enter();
> >>
> >>         instrumentation_begin();
> >> +       unpoison(regs);
> >>         trace_hardirqs_off_finish();
> >>         ftrace_nmi_enter();
> >>         instrumentation_end();
> >>
> >> As I said: 4 places :)
> >
> > These four instances still do not look sufficient.
> > Right now I am seeing e.g. reports with the following stack trace:
> >
> > BUG: KMSAN: uninit-value in irqtime_account_process_tick+0x255/0x580
> > kernel/sched/cputime.c:382
> >  irqtime_account_process_tick+0x255/0x580 kernel/sched/cputime.c:382
> >  account_process_tick+0x98/0x450 kernel/sched/cputime.c:476
> >  update_process_times+0xe4/0x3e0 kernel/time/timer.c:1786
> >  tick_sched_handle kernel/time/tick-sched.c:243
> >  tick_sched_timer+0x83e/0x9e0 kernel/time/tick-sched.c:1473
> >  __run_hrtimer+0x518/0xe50 kernel/time/hrtimer.c:1685
> >  __hrtimer_run_queues kernel/time/hrtimer.c:1749
> >  hrtimer_interrupt+0x838/0x15a0 kernel/time/hrtimer.c:1811
> >  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086
> >  __sysvec_apic_timer_interrupt+0x1ae/0x680 arch/x86/kernel/apic/apic.c:1103
> >  sysvec_apic_timer_interrupt+0x95/0xc0 arch/x86/kernel/apic/apic.c:1097
> > ...
> > (uninit creation stack trace is irrelevant here, because it is some
> > random value from the stack)
> >
> > sysvec_apic_timer_interrupt() receives struct pt_regs from
> > uninstrumented code, so regs can be partially uninitialized.
> > They are not passed down the call stack directly, but are instead
> > saved by set_irq_regs() in sysvec_apic_timer_interrupt() and loaded by
> > get_irq_regs() in tick_sched_timer().
>
> sysvec_apic_timer_interrupt() invokes irqentry_enter() _before_
> set_irq_regs() and irqentry_enter() unpoisons @reg.
>
> Confused...

As far as I can tell in this case sysvect_apic_timer_interrupt() is
called by the following code in arch/x86/kernel/idt.c:

  INTG(LOCAL_TIMER_VECTOR,                asm_sysvec_apic_timer_interrupt),

, which does not use IDTENTRY_SYSVEC framework and thus does not call
irqentry_enter().

I guess handling those will require wrapping every interrupt gate into
a function that performs register unpoisoning?

By the way, if it helps, I think we don't necessarily have to call
kmsan_unpoison_memory() from within the
instrumentation_begin()/instrumentation_end() region?
We could move the call to the beginning of irqentry_enter(), removing
unnecessary duplication.
Thomas Gleixner May 6, 2022, 6:41 p.m. UTC | #9
On Fri, May 06 2022 at 19:41, Alexander Potapenko wrote:
> On Fri, May 6, 2022 at 6:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> sysvec_apic_timer_interrupt() invokes irqentry_enter() _before_
>> set_irq_regs() and irqentry_enter() unpoisons @reg.
>>
>> Confused...
>
> As far as I can tell in this case sysvect_apic_timer_interrupt() is
> called by the following code in arch/x86/kernel/idt.c:
>
>   INTG(LOCAL_TIMER_VECTOR,                asm_sysvec_apic_timer_interrupt),
>
> , which does not use IDTENTRY_SYSVEC framework and thus does not call
> irqentry_enter().

  asm_sysvec_apic_timer_interrupt != sysvec_apic_timer_interrupt

arch/x86/kernel/apic/apic.c:
DEFINE_IDTENTRY_SYSVEC(sysvec_apic_timer_interrupt)
{
        ....

#define DEFINE_IDTENTRY_SYSVEC(func)					\
static void __##func(struct pt_regs *regs);				\
									\
__visible noinstr void func(struct pt_regs *regs)			\
{									\
	irqentry_state_t state = irqentry_enter(regs);			\
        ....
	__##func (regs);						\
        ....
}                                                                       \
		                                                        \
static noinline void __##func(struct pt_regs *regs)

So it goes through that code path _before_ the actual implementation
which does set_irq_regs() is reached.

The callchain is:

  asm_sysvec_apic_timer_interrupt               <- ASM entry in gate
     sysvec_apic_timer_interrupt(regs)          <- noinstr C entry point
        irqentry_enter(regs)                    <- unpoisons @reg
        __sysvec_apic_timer_interrupt(regs)     <- the actual handler
           set_irq_regs(regs)                   <- stores regs
           local_apic_timer_interrupt()
             ...
             tick_handler()                     <- One of the 4 variants
                regs = get_irq_regs();          <- retrieves regs
                update_process_times(user_tick = user_mode(regs))
                   account_process_tick(user_tick)
                      irqtime_account_process_tick(user_tick)
line 382:                } else if { user_tick }   <- KMSAN complains

I'm even more confused now.

> I guess handling those will require wrapping every interrupt gate into
> a function that performs register unpoisoning?

No, guessing does not help here.

The gates point to the ASM entry point, which then invokes the C entry
point. All C entry points use a DEFINE_IDTENTRY variant.

Some of the DEFINE_IDTENTRY_* C entry points are not doing anything in
the macro, but the C function either invokes irqentry_enter() or
irqentry_nmi_enter() open coded _before_ invoking any instrumentable
function. So the unpoisoning of @regs in these functions should tell
KMSAN that @regs or something derived from @regs are not some random
uninitialized values.

There should be no difference between unpoisoning @regs in
irqentry_enter() or in set_irq_regs(), right?

If so, then the problem is definitely _not_ the idt entry code.

> By the way, if it helps, I think we don't necessarily have to call
> kmsan_unpoison_memory() from within the
> instrumentation_begin()/instrumentation_end() region?
> We could move the call to the beginning of irqentry_enter(), removing
> unnecessary duplication.

We could, but then you need to mark unpoison_memory() noinstr too and you
have to add the unpoison into the syscall code. No win and irrelevant to
the problem at hand.

Thanks,

        tglx
Alexander Potapenko May 9, 2022, 4:50 p.m. UTC | #10
> The callchain is:
>
>   asm_sysvec_apic_timer_interrupt               <- ASM entry in gate
>      sysvec_apic_timer_interrupt(regs)          <- noinstr C entry point
>         irqentry_enter(regs)                    <- unpoisons @reg
>         __sysvec_apic_timer_interrupt(regs)     <- the actual handler
>            set_irq_regs(regs)                   <- stores regs
>            local_apic_timer_interrupt()
>              ...
>              tick_handler()                     <- One of the 4 variants
>                 regs = get_irq_regs();          <- retrieves regs
>                 update_process_times(user_tick = user_mode(regs))
>                    account_process_tick(user_tick)
>                       irqtime_account_process_tick(user_tick)
> line 382:                } else if { user_tick }   <- KMSAN complains
>
> I'm even more confused now.

Ok, I think I know what's going on.

Indeed, calling kmsan_unpoison_memory() in irqentry_enter() was
supposed to be enough, but we have code in kmsan_unpoison_memory() (as
well as other runtime functions) that checks for kmsan_in_runtime()
and bails out to prevent potential recursion if KMSAN code starts
calling itself.

kmsan_in_runtime() is implemented as follows:

==============================================
static __always_inline bool kmsan_in_runtime(void)
{
  if ((hardirq_count() >> HARDIRQ_SHIFT) > 1)
    return true;
  return kmsan_get_context()->kmsan_in_runtime;
}
==============================================
(see the code here:
https://lore.kernel.org/lkml/20220426164315.625149-13-glider@google.com/#Z31mm:kmsan:kmsan.h)

If we are running in the task context (in_task()==true),
kmsan_get_context() returns a per-task `struct *kmsan_ctx`.
If `in_task()==false` and `hardirq_count()>>HARDIRQ_SHIFT==1`, it
returns a per-CPU one.
Otherwise kmsan_in_runtime() is considered true to avoid dealing with
nested interrupts.

So in the case when `hardirq_count()>>HARDIRQ_SHIFT` is greater than
1, kmsan_in_runtime() becomes a no-op, which leads to false positives.

The solution I currently have in mind is to provide a specialized
version of kmsan_unpoison_memory() for entry.c, which will not perform
the reentrancy checks.

> > I guess handling those will require wrapping every interrupt gate into
> > a function that performs register unpoisoning?
>
> No, guessing does not help here.
>
> The gates point to the ASM entry point, which then invokes the C entry
> point. All C entry points use a DEFINE_IDTENTRY variant.

Thanks for the explanation. I previously thought there were two
different entry points, one with asm_ and one without, that ended up
calling the same code.

> Some of the DEFINE_IDTENTRY_* C entry points are not doing anything in
> the macro, but the C function either invokes irqentry_enter() or
> irqentry_nmi_enter() open coded _before_ invoking any instrumentable
> function. So the unpoisoning of @regs in these functions should tell
> KMSAN that @regs or something derived from @regs are not some random
> uninitialized values.
>
> There should be no difference between unpoisoning @regs in
> irqentry_enter() or in set_irq_regs(), right?
>
> If so, then the problem is definitely _not_ the idt entry code.
>
> > By the way, if it helps, I think we don't necessarily have to call
> > kmsan_unpoison_memory() from within the
> > instrumentation_begin()/instrumentation_end() region?
> > We could move the call to the beginning of irqentry_enter(), removing
> > unnecessary duplication.
>
> We could, but then you need to mark unpoison_memory() noinstr too and you
> have to add the unpoison into the syscall code. No win and irrelevant to
> the problem at hand.

Makes sense, thank you!

> Thanks,
>
>         tglx
>
>
Alexander Potapenko May 9, 2022, 4:51 p.m. UTC | #11
On Mon, May 9, 2022 at 6:50 PM Alexander Potapenko <glider@google.com> wrote:
>
> > The callchain is:
> >
> >   asm_sysvec_apic_timer_interrupt               <- ASM entry in gate
> >      sysvec_apic_timer_interrupt(regs)          <- noinstr C entry point
> >         irqentry_enter(regs)                    <- unpoisons @reg
> >         __sysvec_apic_timer_interrupt(regs)     <- the actual handler
> >            set_irq_regs(regs)                   <- stores regs
> >            local_apic_timer_interrupt()
> >              ...
> >              tick_handler()                     <- One of the 4 variants
> >                 regs = get_irq_regs();          <- retrieves regs
> >                 update_process_times(user_tick = user_mode(regs))
> >                    account_process_tick(user_tick)
> >                       irqtime_account_process_tick(user_tick)
> > line 382:                } else if { user_tick }   <- KMSAN complains
> >
> > I'm even more confused now.
>
> Ok, I think I know what's going on.
>
> Indeed, calling kmsan_unpoison_memory() in irqentry_enter() was
> supposed to be enough, but we have code in kmsan_unpoison_memory() (as
> well as other runtime functions) that checks for kmsan_in_runtime()
> and bails out to prevent potential recursion if KMSAN code starts
> calling itself.
>
> kmsan_in_runtime() is implemented as follows:
>
> ==============================================
> static __always_inline bool kmsan_in_runtime(void)
> {
>   if ((hardirq_count() >> HARDIRQ_SHIFT) > 1)
>     return true;
>   return kmsan_get_context()->kmsan_in_runtime;
> }
> ==============================================
> (see the code here:
> https://lore.kernel.org/lkml/20220426164315.625149-13-glider@google.com/#Z31mm:kmsan:kmsan.h)
>
> If we are running in the task context (in_task()==true),
> kmsan_get_context() returns a per-task `struct *kmsan_ctx`.
> If `in_task()==false` and `hardirq_count()>>HARDIRQ_SHIFT==1`, it
> returns a per-CPU one.
> Otherwise kmsan_in_runtime() is considered true to avoid dealing with
> nested interrupts.
>
> So in the case when `hardirq_count()>>HARDIRQ_SHIFT` is greater than
> 1, kmsan_in_runtime() becomes a no-op, which leads to false positives.
Should be "kmsan_unpoison_memory() becomes a no-op..."
Thomas Gleixner May 9, 2022, 7:09 p.m. UTC | #12
On Mon, May 09 2022 at 18:50, Alexander Potapenko wrote:
> Indeed, calling kmsan_unpoison_memory() in irqentry_enter() was
> supposed to be enough, but we have code in kmsan_unpoison_memory() (as
> well as other runtime functions) that checks for kmsan_in_runtime()
> and bails out to prevent potential recursion if KMSAN code starts
> calling itself.
>
> kmsan_in_runtime() is implemented as follows:
>
> ==============================================
> static __always_inline bool kmsan_in_runtime(void)
> {
>   if ((hardirq_count() >> HARDIRQ_SHIFT) > 1)
>     return true;
>   return kmsan_get_context()->kmsan_in_runtime;
> }
> ==============================================
> (see the code here:
> https://lore.kernel.org/lkml/20220426164315.625149-13-glider@google.com/#Z31mm:kmsan:kmsan.h)
>
> If we are running in the task context (in_task()==true),
> kmsan_get_context() returns a per-task `struct *kmsan_ctx`.
> If `in_task()==false` and `hardirq_count()>>HARDIRQ_SHIFT==1`, it
> returns a per-CPU one.
> Otherwise kmsan_in_runtime() is considered true to avoid dealing with
> nested interrupts.
>
> So in the case when `hardirq_count()>>HARDIRQ_SHIFT` is greater than
> 1, kmsan_in_runtime() becomes a no-op, which leads to false positives.

But, that'd only > 1 when there is a nested interrupt, which is not the
case. Interrupt handlers keep interrupts disabled. The last exception from
that rule was some legacy IDE driver which is gone by now.

So no, not a good explanation either.

Thanks,

        tglx
Alexander Potapenko May 12, 2022, 12:24 p.m. UTC | #13
On Mon, May 9, 2022 at 9:09 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, May 09 2022 at 18:50, Alexander Potapenko wrote:
> > Indeed, calling kmsan_unpoison_memory() in irqentry_enter() was
> > supposed to be enough, but we have code in kmsan_unpoison_memory() (as
> > well as other runtime functions) that checks for kmsan_in_runtime()
> > and bails out to prevent potential recursion if KMSAN code starts
> > calling itself.
> >
> > kmsan_in_runtime() is implemented as follows:
> >
> > ==============================================
> > static __always_inline bool kmsan_in_runtime(void)
> > {
> >   if ((hardirq_count() >> HARDIRQ_SHIFT) > 1)
> >     return true;
> >   return kmsan_get_context()->kmsan_in_runtime;
> > }
> > ==============================================
> > (see the code here:
> > https://lore.kernel.org/lkml/20220426164315.625149-13-glider@google.com/#Z31mm:kmsan:kmsan.h)
> >
> > If we are running in the task context (in_task()==true),
> > kmsan_get_context() returns a per-task `struct *kmsan_ctx`.
> > If `in_task()==false` and `hardirq_count()>>HARDIRQ_SHIFT==1`, it
> > returns a per-CPU one.
> > Otherwise kmsan_in_runtime() is considered true to avoid dealing with
> > nested interrupts.
> >
> > So in the case when `hardirq_count()>>HARDIRQ_SHIFT` is greater than
> > 1, kmsan_in_runtime() becomes a no-op, which leads to false positives.
>
> But, that'd only > 1 when there is a nested interrupt, which is not the
> case. Interrupt handlers keep interrupts disabled. The last exception from
> that rule was some legacy IDE driver which is gone by now.

That's good to know, then we probably don't need this hardirq_count()
check anymore.

> So no, not a good explanation either.

After looking deeper I see that unpoisoning was indeed skipped because
kmsan_in_runtime() returned true, but I was wrong about the root
cause.
The problem was not caused by a nested hardirq, but rather by the fact
that the KMSAN hook in irqentry_enter() was called with in_task()==1.

Roughly said, T0 was running some code in the task context, then it
started executing KMSAN instrumentation and entered the runtime by
setting current->kmsan_ctx.kmsan_in_runtime.
Then an IRQ kicked in and started calling
asm_sysvec_apic_timer_interrupt() => sysvec_apic_timer_interrupt(regs)
=> irqentry_enter(regs) - but at that point in_task() was still true,
therefore kmsan_unpoison_memory() became a no-op.

As far as I can see, it is irq_enter_rcu() that makes in_task() return
0 by incrementing the preempt count in __irq_enter_raw(), so our
unpoisoning can only work if we perform it after we enter the IRQ
context.

I think the best that can be done here is (as suggested above) to
provide some kmsan_unpoison_pt_regs() function that will only be
called from the entry points and won't be doing reentrancy checks.
It should be safe, because unpoisoning boils down to calculating
shadow/origin addresses and calling memset() on them, no instrumented
code will be involved.

We could try to figure out the places in idtentry code where normal
kmsan_unpoison_memory() can be called in IRQ context, but as far as I
can see it will depend on the type of the entry point.

Another way to deal with the problem is to not rely on in_task(), but
rather use some per-cpu counter in irqentry_enter()/irqentry_exit() to
figure out whether we are in IRQ code already.
However this is only possible irqentry_enter() itself guarantees that
the execution cannot be rescheduled to another CPU - is that the case?

> Thanks,
>
>         tglx
>
> --
> You received this message because you are subscribed to the Google Groups "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kasan-dev+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kasan-dev/871qx2r09k.ffs%40tglx.
Thomas Gleixner May 12, 2022, 4:17 p.m. UTC | #14
On Thu, May 12 2022 at 14:24, Alexander Potapenko wrote:
> On Mon, May 9, 2022 at 9:09 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > So in the case when `hardirq_count()>>HARDIRQ_SHIFT` is greater than
>> > 1, kmsan_in_runtime() becomes a no-op, which leads to false positives.
>>
>> But, that'd only > 1 when there is a nested interrupt, which is not the
>> case. Interrupt handlers keep interrupts disabled. The last exception from
>> that rule was some legacy IDE driver which is gone by now.
>
> That's good to know, then we probably don't need this hardirq_count()
> check anymore.
>
>> So no, not a good explanation either.
>
> After looking deeper I see that unpoisoning was indeed skipped because
> kmsan_in_runtime() returned true, but I was wrong about the root
> cause.
> The problem was not caused by a nested hardirq, but rather by the fact
> that the KMSAN hook in irqentry_enter() was called with in_task()==1.

Argh, the preempt counter increment happens _after_ irqentry_enter().

> I think the best that can be done here is (as suggested above) to
> provide some kmsan_unpoison_pt_regs() function that will only be
> called from the entry points and won't be doing reentrancy checks.
> It should be safe, because unpoisoning boils down to calculating
> shadow/origin addresses and calling memset() on them, no instrumented
> code will be involved.

If you keep them where I placed them, then there is no need for a
noinstr function. It's already instrumentable.

> We could try to figure out the places in idtentry code where normal
> kmsan_unpoison_memory() can be called in IRQ context, but as far as I
> can see it will depend on the type of the entry point.

NMI is covered as it increments before it invokes the unpoison().

Let me figure out why we increment the preempt count late for
interrupts. IIRC it's for symmetry reasons related to softirq processing
on return, but let me double check.

> Another way to deal with the problem is to not rely on in_task(), but
> rather use some per-cpu counter in irqentry_enter()/irqentry_exit() to
> figure out whether we are in IRQ code already.

Well, if you have a irqentry() specific unpoison, then you know the
context, right?

> However this is only possible irqentry_enter() itself guarantees that
> the execution cannot be rescheduled to another CPU - is that the case?

Obviously. It runs with interrupts disabled and eventually on a
separate interrupt stack.

Thanks,

        tglx
Thomas Gleixner May 12, 2022, 4:48 p.m. UTC | #15
On Thu, May 12 2022 at 18:17, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 14:24, Alexander Potapenko wrote:
>> We could try to figure out the places in idtentry code where normal
>> kmsan_unpoison_memory() can be called in IRQ context, but as far as I
>> can see it will depend on the type of the entry point.
>
> NMI is covered as it increments before it invokes the unpoison().
>
> Let me figure out why we increment the preempt count late for
> interrupts. IIRC it's for symmetry reasons related to softirq processing
> on return, but let me double check.

It's even documented:

 https://www.kernel.org/doc/html/latest/core-api/entry.html#interrupts-and-regular-exceptions

But who reads documentation? :)

So, I think the simplest and least intrusive solution is to have special
purpose unpoison functions. See the patch below for illustration.

The reasons why I used specific ones:

  1) User entry

     Whether that's a syscall or interrupt/exception does not
     matter. It's always on the task stack and your machinery cannot be
     running at that point because it came from user space.

  2) Interrupt/exception/NMI entry kernel
  
     Those can nest into an already active context, so you really want
     to unpoison @regs.

     Also while regular interrupts cannot nest because of interrupts
     staying disabled, exceptions triggered in the interrupt handler and
     NMIs can nest.

     -> device interrupt()
           irqentry_enter(regs)

        -> NMI()
           irqentry_nmi_enter(regs)

           -> fault()
              irqentry_enter(regs)
          
              --> debug_exception()
                  irqentry_nmi_enter(regs)

     Soft interrupt processing on return from interrupt makes it more
     interesting:

     interrupt()
       handler()
       do_softirq()
         local_irq_enable()
            interrupt()
              NMI
                ....

     And everytime you get a new @regs pointer to deal with.

Wonderful, isn't it?

Thanks,

        tglx

---
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -24,6 +24,7 @@ static __always_inline void __enter_from
 	user_exit_irqoff();
 
 	instrumentation_begin();
+	unpoison_user(regs);
 	trace_hardirqs_off_finish();
 	instrumentation_end();
 }
@@ -352,6 +353,7 @@ noinstr irqentry_state_t irqentry_enter(
 		lockdep_hardirqs_off(CALLER_ADDR0);
 		rcu_irq_enter();
 		instrumentation_begin();
+		unpoison_irq(regs);
 		trace_hardirqs_off_finish();
 		instrumentation_end();
 
@@ -367,6 +369,7 @@ noinstr irqentry_state_t irqentry_enter(
 	 */
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	instrumentation_begin();
+	unpoison_irq(regs);
 	rcu_irq_enter_check_tick();
 	trace_hardirqs_off_finish();
 	instrumentation_end();
@@ -452,6 +455,7 @@ irqentry_state_t noinstr irqentry_nmi_en
 	rcu_nmi_enter();
 
 	instrumentation_begin();
+	unpoison_irq(regs);
 	trace_hardirqs_off_finish();
 	ftrace_nmi_enter();
 	instrumentation_end();
Alexander Potapenko June 1, 2022, 11:27 a.m. UTC | #16
On Thu, May 12, 2022 at 6:48 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, May 12 2022 at 18:17, Thomas Gleixner wrote:
> > On Thu, May 12 2022 at 14:24, Alexander Potapenko wrote:
> >> We could try to figure out the places in idtentry code where normal
> >> kmsan_unpoison_memory() can be called in IRQ context, but as far as I
> >> can see it will depend on the type of the entry point.
> >
> > NMI is covered as it increments before it invokes the unpoison().
> >
> > Let me figure out why we increment the preempt count late for
> > interrupts. IIRC it's for symmetry reasons related to softirq processing
> > on return, but let me double check.
>
> It's even documented:
>
>  https://www.kernel.org/doc/html/latest/core-api/entry.html#interrupts-and-regular-exceptions
>
> But who reads documentation? :)
>
> So, I think the simplest and least intrusive solution is to have special
> purpose unpoison functions. See the patch below for illustration.

This patch works well and I am going to adopt it for my series.
But the problem with occasional calls of instrumented functions from
noinstr still persists: if there is a noinstr function foo() and an
instrumented function bar() called from foo() with one or more
arguments, bar() must wipe its kmsan_context_state before using the
arguments.

I have a solution for this problem described in https://reviews.llvm.org/D126385
The plan is to pass __builtin_return_address(0) to
__msan_get_context_state_caller() at the beginning of each
instrumented function.
Then KMSAN runtime can check the passed return address and wipe the
context if it belongs to the .noinstr code section.

Alternatively, we could employ MSan's -fsanitize-memory-param-retval
flag, that will report supplying uninitialized parameters when calling
functions.
Doing so is currently allowed in the kernel, but Clang aggressively
applies the noundef attribute (see https://llvm.org/docs/LangRef.html)
to function arguments, which effectively makes passing uninit values
as function parameters an UB.
So if we make KMSAN detect such cases as well, we can ultimately get
rid of all cases when uninits are passed to functions.
As a result, kmsan_context_state will become unnecessary, because it
will never contain nonzero values.


> The reasons why I used specific ones:
>
>   1) User entry
>
>      Whether that's a syscall or interrupt/exception does not
>      matter. It's always on the task stack and your machinery cannot be
>      running at that point because it came from user space.
>
>   2) Interrupt/exception/NMI entry kernel
>
>      Those can nest into an already active context, so you really want
>      to unpoison @regs.
>
>      Also while regular interrupts cannot nest because of interrupts
>      staying disabled, exceptions triggered in the interrupt handler and
>      NMIs can nest.
>
>      -> device interrupt()
>            irqentry_enter(regs)
>
>         -> NMI()
>            irqentry_nmi_enter(regs)
>
>            -> fault()
>               irqentry_enter(regs)
>
>               --> debug_exception()
>                   irqentry_nmi_enter(regs)
>
>      Soft interrupt processing on return from interrupt makes it more
>      interesting:
>
>      interrupt()
>        handler()
>        do_softirq()
>          local_irq_enable()
>             interrupt()
>               NMI
>                 ....
>
>      And everytime you get a new @regs pointer to deal with.
>
> Wonderful, isn't it?
>
> Thanks,
>
>         tglx
>
diff mbox series

Patch

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 93c3b86e781c1..ce2324374882c 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -23,7 +23,7 @@  static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 	CT_WARN_ON(ct_state() != CONTEXT_USER);
 	user_exit_irqoff();
 
-	instrumentation_begin();
+	instrumentation_begin_with_regs(regs);
 	trace_hardirqs_off_finish();
 	instrumentation_end();
 }
@@ -105,7 +105,7 @@  noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
 
 	__enter_from_user_mode(regs);
 
-	instrumentation_begin();
+	instrumentation_begin_with_regs(regs);
 	local_irq_enable();
 	ret = __syscall_enter_from_user_work(regs, syscall);
 	instrumentation_end();
@@ -116,7 +116,7 @@  noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
 noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
 {
 	__enter_from_user_mode(regs);
-	instrumentation_begin();
+	instrumentation_begin_with_regs(regs);
 	local_irq_enable();
 	instrumentation_end();
 }
@@ -290,7 +290,7 @@  void syscall_exit_to_user_mode_work(struct pt_regs *regs)
 
 __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
 {
-	instrumentation_begin();
+	instrumentation_begin_with_regs(regs);
 	__syscall_exit_to_user_mode_work(regs);
 	instrumentation_end();
 	__exit_to_user_mode();
@@ -303,7 +303,7 @@  noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
 
 noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
 {
-	instrumentation_begin();
+	instrumentation_begin_with_regs(regs);
 	exit_to_user_mode_prepare(regs);
 	instrumentation_end();
 	__exit_to_user_mode();
@@ -351,7 +351,7 @@  noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 		 */
 		lockdep_hardirqs_off(CALLER_ADDR0);
 		rcu_irq_enter();
-		instrumentation_begin();
+		instrumentation_begin_with_regs(regs);
 		trace_hardirqs_off_finish();
 		instrumentation_end();
 
@@ -366,7 +366,7 @@  noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 	 * in having another one here.
 	 */
 	lockdep_hardirqs_off(CALLER_ADDR0);
-	instrumentation_begin();
+	instrumentation_begin_with_regs(regs);
 	rcu_irq_enter_check_tick();
 	trace_hardirqs_off_finish();
 	instrumentation_end();
@@ -413,7 +413,7 @@  noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		 * and RCU as the return to user mode path.
 		 */
 		if (state.exit_rcu) {
-			instrumentation_begin();
+			instrumentation_begin_with_regs(regs);
 			/* Tell the tracer that IRET will enable interrupts */
 			trace_hardirqs_on_prepare();
 			lockdep_hardirqs_on_prepare(CALLER_ADDR0);
@@ -423,7 +423,7 @@  noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 			return;
 		}
 
-		instrumentation_begin();
+		instrumentation_begin_with_regs(regs);
 		if (IS_ENABLED(CONFIG_PREEMPTION))
 			irqentry_exit_cond_resched();
 
@@ -451,7 +451,7 @@  irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 	lockdep_hardirq_enter();
 	rcu_nmi_enter();
 
-	instrumentation_begin();
+	instrumentation_begin_with_regs(regs);
 	trace_hardirqs_off_finish();
 	ftrace_nmi_enter();
 	instrumentation_end();
@@ -461,7 +461,7 @@  irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 
 void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
 {
-	instrumentation_begin();
+	instrumentation_begin_with_regs(regs);
 	ftrace_nmi_exit();
 	if (irq_state.lockdep) {
 		trace_hardirqs_on_prepare();