diff mbox series

[v2] arm64: Enable KCSAN

Message ID 20211129145732.27000-1-wangkefeng.wang@huawei.com (mailing list archive)
State New, archived
Headers show
Series [v2] arm64: Enable KCSAN | expand

Commit Message

Kefeng Wang Nov. 29, 2021, 2:57 p.m. UTC
This patch enables KCSAN for arm64, with updates to build rules
to not use KCSAN for several incompatible compilation units.

Resent GCC version(at least GCC10) made outline-atomics as the
default option(unlike Clang), which will cause linker errors
for kernel/kcsan/core.o.

Disables the out-of-line atomics by no-outline-atomics to fix
the linker errors.

Tested selftest and kcsan_test(built with GCC11 and Clang 13),
and all passed.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
Tested on Qemu with clang 13 / gcc 11, based on 5.16-rc3.

[    0.221518] kcsan: enabled early
[    0.222422] kcsan: strict mode configured
...
[    5.839223] kcsan: selftest: 3/3 tests passed
...
[  517.895102] # kcsan: pass:24 fail:0 skip:0 total:24
[  517.896393] # Totals: pass:168 fail:0 skip:0 total:168
[  517.897502] ok 1 - kcsan

v2:
- tested on GCC11 and disable outline-atomics for kernel/kcsan/core.c
  suggested by Marco Elver

 arch/arm64/Kconfig               | 1 +
 arch/arm64/kernel/vdso/Makefile  | 1 +
 arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
 kernel/kcsan/Makefile            | 1 +
 4 files changed, 4 insertions(+)

Comments

Marco Elver Dec. 1, 2021, 10:17 a.m. UTC | #1
On Mon, 29 Nov 2021 at 15:46, Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> This patch enables KCSAN for arm64, with updates to build rules
> to not use KCSAN for several incompatible compilation units.
>
> Resent GCC version(at least GCC10) made outline-atomics as the
> default option(unlike Clang), which will cause linker errors
> for kernel/kcsan/core.o.
>
> Disables the out-of-line atomics by no-outline-atomics to fix
> the linker errors.
>
> Tested selftest and kcsan_test(built with GCC11 and Clang 13),
> and all passed.
>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>

Acked-by: Marco Elver <elver@google.com> # kernel/kcsan

However, you need to wait for Mark to respond, because he'd know best
what might be missing.
This thread might give some clues:
https://groups.google.com/g/kasan-dev/c/4ySdJfedzso/m/e4CzdfWWBQAJ
(can't find LKML copy)

Thanks,
-- Marco
Mark Rutland Dec. 1, 2021, 11:53 a.m. UTC | #2
Hi Kefeng,

On Mon, Nov 29, 2021 at 10:57:32PM +0800, Kefeng Wang wrote:
> This patch enables KCSAN for arm64, with updates to build rules
> to not use KCSAN for several incompatible compilation units.
> 
> Resent GCC version(at least GCC10) made outline-atomics as the
> default option(unlike Clang), which will cause linker errors
> for kernel/kcsan/core.o.
> 
> Disables the out-of-line atomics by no-outline-atomics to fix
> the linker errors.
> 
> Tested selftest and kcsan_test(built with GCC11 and Clang 13),
> and all passed.

Nice!

I think there are a few additional bits and pieces we'll need:

* Prior to clang 12.0.0, KCSAN would produce warnings with BTI, as I found in:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/kcsan&id=2d67c39ae4f619ca94d9790e09186e77922fa826

  Since BTI is in defconfig, I think arm64's Kconfig should require a minimum
  of clang 12.0.0 to enable KCSAN.

* In the past clang did not have an attribute to suppress tsan instrumenation
  and would instrument noinstr regions. I'm not sure when clang gained the
  relevant attribute to supress this, but we will need to depend on this
  existing, either based on the clang version or with a test for the attribute.

  (If we're lucky, clang 12.0.0 is sufficient, and we solve BTI and this in one
  go).

  I *think* GCC always had an attribute, but I'm not certain.

  Marco, is there an existing dependency somewhere for this to work on x86? I
  thought there was an objtool pass to NOP this out, but I couldn't find it in
  mainline. If x86 is implicitly depending on a sufficiently recent version of
  clang, we add something to the common KCSAN Kconfig for ARCH_WANTS_NO_INSTR?

* There are some latent issues with some code (e.g. alternatives, patching, insn)
  code being instrumentable even though this is unsound, and depending on
  compiler choices this can happen to be fine or can result in boot-time
  failures (I saw lockups when I started trying to add KCSAN for arm64).

  While this isn't just a KCSAN problem, fixing that requires some fairly
  significant rework to a bunch of code, and until that's done we're on very
  shaky ground. So I'd like to make KCSAN depend on EXPERT for now.

  I had an initial stab at fixing some of that, e.g.

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/patching/rework
  
  Joey has started looking into this too.

* When I last tested, for simple boots I would get frequent KCSAN splats for a
  few common issues, and those drowned out all other reports.

  One case was manipulation of thread_info::flags, which Thomas Gleixner has
  queued some fixes at:

  https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=core/entry
 
  There were some other common failures, e.g. accesses to task_struct::on_cpu,
  and I hadn't had the chance to investigate/fix those, beyond a (likely
  unsound) hack:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/kcsan&id=4fe9d6c2ef85257d80291086e4514eaaebd3504e

  It would be good if we could identify the most frequent problems (e.g. those
  that will occur when just booting) before we enable this generally, to avoid
  everyone racing to report/fix those as soon as we enable the feature.

  When you tested, did KCSAN flag anything beyond the selftests?

> 
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
> Tested on Qemu with clang 13 / gcc 11, based on 5.16-rc3.
> 
> [    0.221518] kcsan: enabled early
> [    0.222422] kcsan: strict mode configured
> ...
> [    5.839223] kcsan: selftest: 3/3 tests passed
> ...
> [  517.895102] # kcsan: pass:24 fail:0 skip:0 total:24
> [  517.896393] # Totals: pass:168 fail:0 skip:0 total:168
> [  517.897502] ok 1 - kcsan
> 
> v2:
> - tested on GCC11 and disable outline-atomics for kernel/kcsan/core.c
>   suggested by Marco Elver
> 
>  arch/arm64/Kconfig               | 1 +
>  arch/arm64/kernel/vdso/Makefile  | 1 +
>  arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
>  kernel/kcsan/Makefile            | 1 +
>  4 files changed, 4 insertions(+)

Aside from the `-mno-outline-atomics` portion, this looks basically the same as what I had in:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/kcsan&id=2d67c39ae4f619ca94d9790e09186e77922fa826

... so this looks good to me modulo the comments above.

Thanks,
Mark.

> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 4ff73299f8a9..0ac90875f71d 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -150,6 +150,7 @@ config ARM64
>  	select HAVE_ARCH_KASAN_VMALLOC if HAVE_ARCH_KASAN
>  	select HAVE_ARCH_KASAN_SW_TAGS if HAVE_ARCH_KASAN
>  	select HAVE_ARCH_KASAN_HW_TAGS if (HAVE_ARCH_KASAN && ARM64_MTE)
> +	select HAVE_ARCH_KCSAN
>  	select HAVE_ARCH_KFENCE
>  	select HAVE_ARCH_KGDB
>  	select HAVE_ARCH_MMAP_RND_BITS
> diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile
> index 700767dfd221..60813497a381 100644
> --- a/arch/arm64/kernel/vdso/Makefile
> +++ b/arch/arm64/kernel/vdso/Makefile
> @@ -32,6 +32,7 @@ ccflags-y += -DDISABLE_BRANCH_PROFILING -DBUILD_VDSO
>  CFLAGS_REMOVE_vgettimeofday.o = $(CC_FLAGS_FTRACE) -Os $(CC_FLAGS_SCS) $(GCC_PLUGINS_CFLAGS) \
>  				$(CC_FLAGS_LTO)
>  KASAN_SANITIZE			:= n
> +KCSAN_SANITIZE			:= n
>  UBSAN_SANITIZE			:= n
>  OBJECT_FILES_NON_STANDARD	:= y
>  KCOV_INSTRUMENT			:= n
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> index c3c11974fa3b..24b2c2425b38 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -89,6 +89,7 @@ KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_FTRACE) $(CC_FLAGS_SCS) $(CC_FLAGS_CFI)
>  # cause crashes. Just disable it.
>  GCOV_PROFILE	:= n
>  KASAN_SANITIZE	:= n
> +KCSAN_SANITIZE	:= n
>  UBSAN_SANITIZE	:= n
>  KCOV_INSTRUMENT	:= n
>  
> diff --git a/kernel/kcsan/Makefile b/kernel/kcsan/Makefile
> index c2bb07f5bcc7..e893b0e1d62a 100644
> --- a/kernel/kcsan/Makefile
> +++ b/kernel/kcsan/Makefile
> @@ -8,6 +8,7 @@ CFLAGS_REMOVE_debugfs.o = $(CC_FLAGS_FTRACE)
>  CFLAGS_REMOVE_report.o = $(CC_FLAGS_FTRACE)
>  
>  CFLAGS_core.o := $(call cc-option,-fno-conserve-stack) \
> +	$(call cc-option,-mno-outline-atomics) \
>  	-fno-stack-protector -DDISABLE_BRANCH_PROFILING
>  
>  obj-y := core.o debugfs.o report.o
> -- 
> 2.26.2
>
Marco Elver Dec. 1, 2021, 12:25 p.m. UTC | #3
On Wed, Dec 01, 2021 at 11:53AM +0000, Mark Rutland wrote:
[...]
> * In the past clang did not have an attribute to suppress tsan instrumenation
>   and would instrument noinstr regions. I'm not sure when clang gained the
>   relevant attribute to supress this, but we will need to depend on this
>   existing, either based on the clang version or with a test for the attribute.
> 
>   (If we're lucky, clang 12.0.0 is sufficient, and we solve BTI and this in one
>   go).
> 
>   I *think* GCC always had an attribute, but I'm not certain.
> 
>   Marco, is there an existing dependency somewhere for this to work on x86? I
>   thought there was an objtool pass to NOP this out, but I couldn't find it in
>   mainline. If x86 is implicitly depending on a sufficiently recent version of
>   clang, we add something to the common KCSAN Kconfig for ARCH_WANTS_NO_INSTR?

Apologies for the confusion w.r.t. attributes and which sanitizers on
which compilers respect which attributes. I think you may be confusing
things with KCOV (see 540540d06e9d9). I think the various 'select
ARCH_HAS_KCOV' may need adjusting, that is true.

But KCOV != KCSAN, and for KCSAN the story is different. Since the first
version of KCSAN in 5.8, we've had a working __no_kcsan (aka
__no_sanitize_thread) with all versions of Clang and GCC that support
KCSAN (see HAVE_KCSAN_COMPILER).

The recent discussion was for CONFIG_KCSAN_WEAK_MEMORY [1], because
Clang's no_sanitize("thread") would still instrument builtin atomics and
__tsan_func_{entry,exit}, which only that mode would start inserting
instrumentation for. That's not in mainline yet.

[1] https://lkml.kernel.org/r/20211130114433.2580590-1-elver@google.com

It is true that v1 and v2 of that series would have caused issues on
arm64, but after our discussion last week and tried a little harder and
v3 does the right thing for all architectures now and __no_kcsan will
disable all instrumentation (even for barriers).

So the attribute and noinstr story should not need anything else, and
the new KCSAN_WEAK_MEMORY has all right Kconfig dependencies in place
when it lands in mainline.

> * There are some latent issues with some code (e.g. alternatives, patching, insn)
>   code being instrumentable even though this is unsound, and depending on
>   compiler choices this can happen to be fine or can result in boot-time
>   failures (I saw lockups when I started trying to add KCSAN for arm64).
> 
>   While this isn't just a KCSAN problem, fixing that requires some fairly
>   significant rework to a bunch of code, and until that's done we're on very
>   shaky ground. So I'd like to make KCSAN depend on EXPERT for now.

I take it you mean arm64 should do 'select HAVE_ARCH_KCSAN if EXPERT'. I
certainly don't mind, but usually folks interested in running debug
tools won't be stopped by a dependency on EXPERT. You could do 'select
HAVE_ARCH_KCSAN if BROKEN' which is more likely to prevent usage while
things are still likely to be broken on arm64.

Thanks,
-- Marco
Mark Rutland Dec. 1, 2021, 3:05 p.m. UTC | #4
On Wed, Dec 01, 2021 at 01:25:22PM +0100, Marco Elver wrote:
> On Wed, Dec 01, 2021 at 11:53AM +0000, Mark Rutland wrote:
> [...]
> > * In the past clang did not have an attribute to suppress tsan instrumenation
> >   and would instrument noinstr regions. I'm not sure when clang gained the
> >   relevant attribute to supress this, but we will need to depend on this
> >   existing, either based on the clang version or with a test for the attribute.
> > 
> >   (If we're lucky, clang 12.0.0 is sufficient, and we solve BTI and this in one
> >   go).
> > 
> >   I *think* GCC always had an attribute, but I'm not certain.
> > 
> >   Marco, is there an existing dependency somewhere for this to work on x86? I
> >   thought there was an objtool pass to NOP this out, but I couldn't find it in
> >   mainline. If x86 is implicitly depending on a sufficiently recent version of
> >   clang, we add something to the common KCSAN Kconfig for ARCH_WANTS_NO_INSTR?
> 
> Apologies for the confusion w.r.t. attributes and which sanitizers on
> which compilers respect which attributes. I think you may be confusing
> things with KCOV (see 540540d06e9d9). I think the various 'select
> ARCH_HAS_KCOV' may need adjusting, that is true.
> 
> But KCOV != KCSAN, and for KCSAN the story is different. Since the first
> version of KCSAN in 5.8, we've had a working __no_kcsan (aka
> __no_sanitize_thread) with all versions of Clang and GCC that support
> KCSAN (see HAVE_KCSAN_COMPILER).

My memory was hazy; I recalled that KCOV was broken in that way, and thought
KCSAN was similarly affected because of the objtool patches.

I'm glad to hear that's not the case!

Since I'm a bit paranoid about this, I'd still like to check that the supported
version (clang 12+, and whatever GCC versions are relevant), but that should
just be a matter of eyeballing the object code, and I'm happy to go do that as
a review step.

> The recent discussion was for CONFIG_KCSAN_WEAK_MEMORY [1], because
> Clang's no_sanitize("thread") would still instrument builtin atomics and
> __tsan_func_{entry,exit}, which only that mode would start inserting
> instrumentation for. That's not in mainline yet.
> 
> [1] https://lkml.kernel.org/r/20211130114433.2580590-1-elver@google.com
> 
> It is true that v1 and v2 of that series would have caused issues on
> arm64, but after our discussion last week and tried a little harder and
> v3 does the right thing for all architectures now and __no_kcsan will
> disable all instrumentation (even for barriers).

Nice; thanks for attacking that! :)

> So the attribute and noinstr story should not need anything else, and
> the new KCSAN_WEAK_MEMORY has all right Kconfig dependencies in place
> when it lands in mainline.
> 
> > * There are some latent issues with some code (e.g. alternatives, patching, insn)
> >   code being instrumentable even though this is unsound, and depending on
> >   compiler choices this can happen to be fine or can result in boot-time
> >   failures (I saw lockups when I started trying to add KCSAN for arm64).
> > 
> >   While this isn't just a KCSAN problem, fixing that requires some fairly
> >   significant rework to a bunch of code, and until that's done we're on very
> >   shaky ground. So I'd like to make KCSAN depend on EXPERT for now.
> 
> I take it you mean arm64 should do 'select HAVE_ARCH_KCSAN if EXPERT'. I
> certainly don't mind, but usually folks interested in running debug
> tools won't be stopped by a dependency on EXPERT. You could do 'select
> HAVE_ARCH_KCSAN if BROKEN' which is more likely to prevent usage while
> things are still likely to be broken on arm64.

TBH, I've gone back-and-forth on this and I'm also happy to go without if
people don't like it.

I had wanted this because I had previously hit issues, but the problem is
admittedly not unique to KCSAN. If we do encounter boot time failures in
practice we can mark it as BROKEN and go from there.

Maybe it would be sufficient to have some Kconfig text noting there are
currently some potential soundness issues with instrumentation, just to give
people a heads-up. The EXPERT dependency was just a way to make people think
twice.

Thanks,
Mark
Kefeng Wang Dec. 2, 2021, 1:35 a.m. UTC | #5
On 2021/12/1 19:53, Mark Rutland wrote:
> Hi Kefeng,
>
> On Mon, Nov 29, 2021 at 10:57:32PM +0800, Kefeng Wang wrote:
>> This patch enables KCSAN for arm64, with updates to build rules
>> to not use KCSAN for several incompatible compilation units.
>>
>> Resent GCC version(at least GCC10) made outline-atomics as the
>> default option(unlike Clang), which will cause linker errors
>> for kernel/kcsan/core.o.
>>
>> Disables the out-of-line atomics by no-outline-atomics to fix
>> the linker errors.
>>
>> Tested selftest and kcsan_test(built with GCC11 and Clang 13),
>> and all passed.
> Nice!
>
> I think there are a few additional bits and pieces we'll need:
>
> * Prior to clang 12.0.0, KCSAN would produce warnings with BTI, as I found in:
>
>    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/kcsan&id=2d67c39ae4f619ca94d9790e09186e77922fa826
>
>    Since BTI is in defconfig, I think arm64's Kconfig should require a minimum
>    of clang 12.0.0 to enable KCSAN.

I don't have different clang version to test,  when check KCSAN,

commit eb32f9f990d9 ("kcsan: Improve some Kconfig comments") saids,


     The compiler instruments plain compound read-write operations
     differently (++, --, +=, -=, |=, &=, etc.), which allows KCSAN to
     distinguish them from other plain accesses. This is currently
     supported by Clang 12 or later.

Should we add a  "depends on CLANG_VERSION >= 120000"


>
> * In the past clang did not have an attribute to suppress tsan instrumenation
>    and would instrument noinstr regions. I'm not sure when clang gained the
>    relevant attribute to supress this, but we will need to depend on this
>    existing, either based on the clang version or with a test for the attribute.
>
>    (If we're lucky, clang 12.0.0 is sufficient, and we solve BTI and this in one
>    go).
>
>    I *think* GCC always had an attribute, but I'm not certain.
>
>    Marco, is there an existing dependency somewhere for this to work on x86? I
>    thought there was an objtool pass to NOP this out, but I couldn't find it in
>    mainline. If x86 is implicitly depending on a sufficiently recent version of
>    clang, we add something to the common KCSAN Kconfig for ARCH_WANTS_NO_INSTR?
>
> * There are some latent issues with some code (e.g. alternatives, patching, insn)
>    code being instrumentable even though this is unsound, and depending on
>    compiler choices this can happen to be fine or can result in boot-time
>    failures (I saw lockups when I started trying to add KCSAN for arm64).
>
>    While this isn't just a KCSAN problem, fixing that requires some fairly
>    significant rework to a bunch of code, and until that's done we're on very
>    shaky ground. So I'd like to make KCSAN depend on EXPERT for now.
>
>    I had an initial stab at fixing some of that, e.g.
>
>    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/patching/rework
>    
>    Joey has started looking into this too.

Thanks for your information,  I don't know about this. As your say, we 
could add a depend on EXPERT

for now and more explanation into changlog.

>
> * When I last tested, for simple boots I would get frequent KCSAN splats for a
>    few common issues, and those drowned out all other reports.
>
>    One case was manipulation of thread_info::flags, which Thomas Gleixner has
>    queued some fixes at:
>
>    https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=core/entry
>   
>    There were some other common failures, e.g. accesses to task_struct::on_cpu,
>    and I hadn't had the chance to investigate/fix those, beyond a (likely
>    unsound) hack:
>
>    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/kcsan&id=4fe9d6c2ef85257d80291086e4514eaaebd3504e
>
>    It would be good if we could identify the most frequent problems (e.g. those
>    that will occur when just booting) before we enable this generally, to avoid
>    everyone racing to report/fix those as soon as we enable the feature.
>
>    When you tested, did KCSAN flag anything beyond the selftests?

Yes, there are some KCSAN reports, but this is not only exist on arm64, 
I saw  owner->on_cpu warning

on x86 too, eg, we also hack to disable it via data_race.

Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 2530 Comm: syz-executor.11 Not tainted 5.10.0+ #113
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.104/01/2014
==================================================================
BUG: KCSAN: data-race in rwsem_spin_on_owner+0xf4/0x180

race at unknown origin, with read to 0xffff9767d3becfac of 4 bytes by task 18119 on cpu 0:
  rwsem_spin_on_owner+0xf4/0x180
  rwsem_optimistic_spin+0x48/0x480
  rwsem_down_read_slowpath+0x4a0/0x670
  down_read+0x69/0x190
  process_vm_rw+0x41e/0x840
  __x64_sys_process_vm_writev+0x76/0x90
  do_syscall_64+0x37/0x50
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

>
>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> ---
>> Tested on Qemu with clang 13 / gcc 11, based on 5.16-rc3.
>>
>> [    0.221518] kcsan: enabled early
>> [    0.222422] kcsan: strict mode configured
>> ...
>> [    5.839223] kcsan: selftest: 3/3 tests passed
>> ...
>> [  517.895102] # kcsan: pass:24 fail:0 skip:0 total:24
>> [  517.896393] # Totals: pass:168 fail:0 skip:0 total:168
>> [  517.897502] ok 1 - kcsan
>>
>> v2:
>> - tested on GCC11 and disable outline-atomics for kernel/kcsan/core.c
>>    suggested by Marco Elver
>>
>>   arch/arm64/Kconfig               | 1 +
>>   arch/arm64/kernel/vdso/Makefile  | 1 +
>>   arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
>>   kernel/kcsan/Makefile            | 1 +
>>   4 files changed, 4 insertions(+)
> Aside from the `-mno-outline-atomics` portion, this looks basically the same as what I had in:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/kcsan&id=2d67c39ae4f619ca94d9790e09186e77922fa826
>
> ... so this looks good to me modulo the comments above.
>
> Thanks,
> Mark.
Marco Elver Dec. 2, 2021, 10:15 a.m. UTC | #6
On Thu, Dec 02, 2021 at 09:35AM +0800, Kefeng Wang wrote:
> 
> On 2021/12/1 19:53, Mark Rutland wrote:
> > Hi Kefeng,
> > 
> > On Mon, Nov 29, 2021 at 10:57:32PM +0800, Kefeng Wang wrote:
> > > This patch enables KCSAN for arm64, with updates to build rules
> > > to not use KCSAN for several incompatible compilation units.
> > > 
> > > Resent GCC version(at least GCC10) made outline-atomics as the
> > > default option(unlike Clang), which will cause linker errors
> > > for kernel/kcsan/core.o.
> > > 
> > > Disables the out-of-line atomics by no-outline-atomics to fix
> > > the linker errors.
> > > 
> > > Tested selftest and kcsan_test(built with GCC11 and Clang 13),
> > > and all passed.
> > Nice!
> > 
> > I think there are a few additional bits and pieces we'll need:
> > 
> > * Prior to clang 12.0.0, KCSAN would produce warnings with BTI, as I found in:
> > 
> >    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/kcsan&id=2d67c39ae4f619ca94d9790e09186e77922fa826
> > 
> >    Since BTI is in defconfig, I think arm64's Kconfig should require a minimum
> >    of clang 12.0.0 to enable KCSAN.
> 
> I don't have different clang version to test,  when check KCSAN,
> 
> commit eb32f9f990d9 ("kcsan: Improve some Kconfig comments") saids,
> 
> 
>     The compiler instruments plain compound read-write operations
>     differently (++, --, +=, -=, |=, &=, etc.), which allows KCSAN to
>     distinguish them from other plain accesses. This is currently
>     supported by Clang 12 or later.
> 
> Should we add a  "depends on CLANG_VERSION >= 120000"

KCSAN works just fine with Clang 11. Clang 12 merely improves some
instrumentation, which is what this comment is about.

What Mark meant is that there's a specific issue with arm64 and BTI that
is fixed by Clang 12. Therefore, arm64's Kconfig will have to do

	select HAVE_ARCH_KCSAN if CC_IS_GCC || CLANG_VERSION >= 120000

> > 
> > * In the past clang did not have an attribute to suppress tsan instrumenation
> >    and would instrument noinstr regions. I'm not sure when clang gained the
> >    relevant attribute to supress this, but we will need to depend on this
> >    existing, either based on the clang version or with a test for the attribute.
> > 
> >    (If we're lucky, clang 12.0.0 is sufficient, and we solve BTI and this in one
> >    go).
> > 
> >    I *think* GCC always had an attribute, but I'm not certain.
> > 
> >    Marco, is there an existing dependency somewhere for this to work on x86? I
> >    thought there was an objtool pass to NOP this out, but I couldn't find it in
> >    mainline. If x86 is implicitly depending on a sufficiently recent version of
> >    clang, we add something to the common KCSAN Kconfig for ARCH_WANTS_NO_INSTR?
> > 
> > * There are some latent issues with some code (e.g. alternatives, patching, insn)
> >    code being instrumentable even though this is unsound, and depending on
> >    compiler choices this can happen to be fine or can result in boot-time
> >    failures (I saw lockups when I started trying to add KCSAN for arm64).
> > 
> >    While this isn't just a KCSAN problem, fixing that requires some fairly
> >    significant rework to a bunch of code, and until that's done we're on very
> >    shaky ground. So I'd like to make KCSAN depend on EXPERT for now.
> > 
> >    I had an initial stab at fixing some of that, e.g.
> > 
> >    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/patching/rework
> >    Joey has started looking into this too.
> 
> Thanks for your information,  I don't know about this. As your say, we could
> add a depend on EXPERT
> 
> for now and more explanation into changlog.

So what I gather arm64's final select line may look like:

	select HAVE_ARCH_KCSAN if EXPERT && (CC_IS_GCC || CLANG_VERSION >= 120000)

> > 
> > * When I last tested, for simple boots I would get frequent KCSAN splats for a
> >    few common issues, and those drowned out all other reports.
> > 
> >    One case was manipulation of thread_info::flags, which Thomas Gleixner has
> >    queued some fixes at:
> > 
> >    https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=core/entry
> >    There were some other common failures, e.g. accesses to task_struct::on_cpu,
> >    and I hadn't had the chance to investigate/fix those, beyond a (likely
> >    unsound) hack:
> > 
> >    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/kcsan&id=4fe9d6c2ef85257d80291086e4514eaaebd3504e
> > 
> >    It would be good if we could identify the most frequent problems (e.g. those
> >    that will occur when just booting) before we enable this generally, to avoid
> >    everyone racing to report/fix those as soon as we enable the feature.
> > 
> >    When you tested, did KCSAN flag anything beyond the selftests?
> 
> Yes, there are some KCSAN reports, but this is not only exist on arm64, I
> saw  owner->on_cpu warning
> 
> on x86 too, eg, we also hack to disable it via data_race.
> 
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 7 PID: 2530 Comm: syz-executor.11 Not tainted 5.10.0+ #113
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.104/01/2014
> ==================================================================
> BUG: KCSAN: data-race in rwsem_spin_on_owner+0xf4/0x180
> 
> race at unknown origin, with read to 0xffff9767d3becfac of 4 bytes by task 18119 on cpu 0:
>  rwsem_spin_on_owner+0xf4/0x180
>  rwsem_optimistic_spin+0x48/0x480
>  rwsem_down_read_slowpath+0x4a0/0x670
>  down_read+0x69/0x190
>  process_vm_rw+0x41e/0x840
>  __x64_sys_process_vm_writev+0x76/0x90
>  do_syscall_64+0x37/0x50
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9

I think fixing data races is not a pre-requisite for arch-enablement.
Some are slowly being addressed (and others aren't -- syzbot has a list
of >200 data races that I try to moderate and fix some or forward those
that I think will get fixed). I expect the most frequent issues will be
the same on arm64 as they are on x86.

I actually have a "fix" for the data race in rwsem_spin_on_owner, that
also shows where the other racing access comes from... which reminds me:
https://lkml.kernel.org/r/20211202101238.33546-1-elver@google.com

Thanks,
-- Marco
Marco Elver Dec. 2, 2021, 10:19 a.m. UTC | #7
On Thu, 2 Dec 2021 at 11:16, Marco Elver <elver@google.com> wrote:
[...]
> > Reported by Kernel Concurrency Sanitizer on:
> > CPU: 7 PID: 2530 Comm: syz-executor.11 Not tainted 5.10.0+ #113
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.104/01/2014
> > ==================================================================
> > BUG: KCSAN: data-race in rwsem_spin_on_owner+0xf4/0x180
> >
> > race at unknown origin, with read to 0xffff9767d3becfac of 4 bytes by task 18119 on cpu 0:
> >  rwsem_spin_on_owner+0xf4/0x180
> >  rwsem_optimistic_spin+0x48/0x480
> >  rwsem_down_read_slowpath+0x4a0/0x670
> >  down_read+0x69/0x190
> >  process_vm_rw+0x41e/0x840
> >  __x64_sys_process_vm_writev+0x76/0x90
> >  do_syscall_64+0x37/0x50
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[...]
> I actually have a "fix" for the data race in rwsem_spin_on_owner, that
> also shows where the other racing access comes from... which reminds me:
> https://lkml.kernel.org/r/20211202101238.33546-1-elver@google.com

Reading be hard ... that patch is for mutex_spin_on_owner. But the
other racing access is the same.

Thanks,
-- Marco
Kefeng Wang Dec. 2, 2021, 10:45 a.m. UTC | #8
On 2021/12/2 18:15, Marco Elver wrote:
> On Thu, Dec 02, 2021 at 09:35AM +0800, Kefeng Wang wrote:
>> On 2021/12/1 19:53, Mark Rutland wrote:
>>> Hi Kefeng,
>>>
>>> On Mon, Nov 29, 2021 at 10:57:32PM +0800, Kefeng Wang wrote:
>>>> This patch enables KCSAN for arm64, with updates to build rules
>>>> to not use KCSAN for several incompatible compilation units.
>>>>
>>>> Resent GCC version(at least GCC10) made outline-atomics as the
>>>> default option(unlike Clang), which will cause linker errors
>>>> for kernel/kcsan/core.o.
>>>>
>>>> Disables the out-of-line atomics by no-outline-atomics to fix
>>>> the linker errors.
>>>>
>>>> Tested selftest and kcsan_test(built with GCC11 and Clang 13),
>>>> and all passed.
>>> Nice!
>>>
>>> I think there are a few additional bits and pieces we'll need:
>>>
>>> * Prior to clang 12.0.0, KCSAN would produce warnings with BTI, as I found in:
>>>
>>>     https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/kcsan&id=2d67c39ae4f619ca94d9790e09186e77922fa826
>>>
>>>     Since BTI is in defconfig, I think arm64's Kconfig should require a minimum
>>>     of clang 12.0.0 to enable KCSAN.
>> I don't have different clang version to test,  when check KCSAN,
>>
>> commit eb32f9f990d9 ("kcsan: Improve some Kconfig comments") saids,
>>
>>
>>      The compiler instruments plain compound read-write operations
>>      differently (++, --, +=, -=, |=, &=, etc.), which allows KCSAN to
>>      distinguish them from other plain accesses. This is currently
>>      supported by Clang 12 or later.
>>
>> Should we add a  "depends on CLANG_VERSION >= 120000"
> KCSAN works just fine with Clang 11. Clang 12 merely improves some
> instrumentation, which is what this comment is about.
>
> What Mark meant is that there's a specific issue with arm64 and BTI that
> is fixed by Clang 12. Therefore, arm64's Kconfig will have to do
>
> 	select HAVE_ARCH_KCSAN if CC_IS_GCC || CLANG_VERSION >= 120000
>
>>> * In the past clang did not have an attribute to suppress tsan instrumenation
>>>     and would instrument noinstr regions. I'm not sure when clang gained the
>>>     relevant attribute to supress this, but we will need to depend on this
>>>     existing, either based on the clang version or with a test for the attribute.
>>>
>>>     (If we're lucky, clang 12.0.0 is sufficient, and we solve BTI and this in one
>>>     go).
>>>
>>>     I *think* GCC always had an attribute, but I'm not certain.
>>>
>>>     Marco, is there an existing dependency somewhere for this to work on x86? I
>>>     thought there was an objtool pass to NOP this out, but I couldn't find it in
>>>     mainline. If x86 is implicitly depending on a sufficiently recent version of
>>>     clang, we add something to the common KCSAN Kconfig for ARCH_WANTS_NO_INSTR?
>>>
>>> * There are some latent issues with some code (e.g. alternatives, patching, insn)
>>>     code being instrumentable even though this is unsound, and depending on
>>>     compiler choices this can happen to be fine or can result in boot-time
>>>     failures (I saw lockups when I started trying to add KCSAN for arm64).
>>>
>>>     While this isn't just a KCSAN problem, fixing that requires some fairly
>>>     significant rework to a bunch of code, and until that's done we're on very
>>>     shaky ground. So I'd like to make KCSAN depend on EXPERT for now.
>>>
>>>     I had an initial stab at fixing some of that, e.g.
>>>
>>>     https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/patching/rework
>>>     Joey has started looking into this too.
>> Thanks for your information,  I don't know about this. As your say, we could
>> add a depend on EXPERT
>>
>> for now and more explanation into changlog.
> So what I gather arm64's final select line may look like:
>
> 	select HAVE_ARCH_KCSAN if EXPERT && (CC_IS_GCC || CLANG_VERSION >= 120000)
Yes,  that's what we want now.
>
>>> * When I last tested, for simple boots I would get frequent KCSAN splats for a
>>>     few common issues, and those drowned out all other reports.
>>>
>>>     One case was manipulation of thread_info::flags, which Thomas Gleixner has
>>>     queued some fixes at:
>>>
>>>     https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=core/entry
>>>     There were some other common failures, e.g. accesses to task_struct::on_cpu,
>>>     and I hadn't had the chance to investigate/fix those, beyond a (likely
>>>     unsound) hack:
>>>
>>>     https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/kcsan&id=4fe9d6c2ef85257d80291086e4514eaaebd3504e
>>>
>>>     It would be good if we could identify the most frequent problems (e.g. those
>>>     that will occur when just booting) before we enable this generally, to avoid
>>>     everyone racing to report/fix those as soon as we enable the feature.
>>>
>>>     When you tested, did KCSAN flag anything beyond the selftests?
>> Yes, there are some KCSAN reports, but this is not only exist on arm64, I
>> saw  owner->on_cpu warning
>>
>> on x86 too, eg, we also hack to disable it via data_race.
>>
>> Reported by Kernel Concurrency Sanitizer on:
>> CPU: 7 PID: 2530 Comm: syz-executor.11 Not tainted 5.10.0+ #113
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.104/01/2014
>> ==================================================================
>> BUG: KCSAN: data-race in rwsem_spin_on_owner+0xf4/0x180
>>
>> race at unknown origin, with read to 0xffff9767d3becfac of 4 bytes by task 18119 on cpu 0:
>>   rwsem_spin_on_owner+0xf4/0x180
>>   rwsem_optimistic_spin+0x48/0x480
>>   rwsem_down_read_slowpath+0x4a0/0x670
>>   down_read+0x69/0x190
>>   process_vm_rw+0x41e/0x840
>>   __x64_sys_process_vm_writev+0x76/0x90
>>   do_syscall_64+0x37/0x50
>>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> I think fixing data races is not a pre-requisite for arch-enablement.
> Some are slowly being addressed (and others aren't -- syzbot has a list
> of >200 data races that I try to moderate and fix some or forward those
> that I think will get fixed). I expect the most frequent issues will be
> the same on arm64 as they are on x86.
>
> I actually have a "fix" for the data race in rwsem_spin_on_owner, that
> also shows where the other racing access comes from... which reminds me:
> https://lkml.kernel.org/r/20211202101238.33546-1-elver@google.com

There's a owner_on_cpu(),  we could reuse it,

diff --git a/include/linux/sched.h b/include/linux/sched.h
index aae991f511c3..f2e99e8f75bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2171,6 +2171,15 @@ static inline bool vcpu_is_preempted(int cpu)
  }
  #endif

+static inline bool owner_on_cpu(struct task_struct *owner)
+{
+	/*
+	 * As lock holder preemption issue, we both skip spinning if
+	 * task is not on cpu or its cpu is preempted
+	 */
+	return READ_ONCE(owner->on_cpu) && !vcpu_is_preempted(task_cpu(owner));
+}
+
  extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
  extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
  
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2fede72b6af5..29e0ac58259d 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -361,11 +361,7 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner,
  		 */
  		barrier();
  
-		/*
-		 * Use vcpu_is_preempted to detect lock holder preemption issue.
-		 */
-		if (!owner->on_cpu || need_resched() ||
-				vcpu_is_preempted(task_cpu(owner))) {
+		if (!owner_on_cpu(owner) || need_resched()) {
  			ret = false;
  			break;
  		}
@@ -396,12 +392,8 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
  	rcu_read_lock();
  	owner = __mutex_owner(lock);
  
-	/*
-	 * As lock holder preemption issue, we both skip spinning if task is not
-	 * on cpu or its cpu is preempted
-	 */
  	if (owner)
-		retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
+		retval = owner_on_cpu(owner);
  	rcu_read_unlock();
  
  	/*
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 000e8d5a2884..30d95a6717d2 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -596,15 +596,6 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
  	return false;
  }
  
-static inline bool owner_on_cpu(struct task_struct *owner)
-{
-	/*
-	 * As lock holder preemption issue, we both skip spinning if
-	 * task is not on cpu or its cpu is preempted
-	 */
-	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
-}
-


> Thanks,
> -- Marco
> .
Marco Elver Dec. 2, 2021, 10:58 a.m. UTC | #9
On Thu, 2 Dec 2021 at 11:45, Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
[...]
> >> for now and more explanation into changlog.
> > So what I gather arm64's final select line may look like:
> >
> >       select HAVE_ARCH_KCSAN if EXPERT && (CC_IS_GCC || CLANG_VERSION >= 120000)
> Yes,  that's what we want now.

I think we have all the pieces sorted out, so hopefully you have
everything you need to do v3.

[...]
> > I actually have a "fix" for the data race in rwsem_spin_on_owner, that
> > also shows where the other racing access comes from... which reminds me:
> > https://lkml.kernel.org/r/20211202101238.33546-1-elver@google.com
>
> There's a owner_on_cpu(),  we could reuse it,

That looks like a reasonable thing to do, but can't say if it is
actually the right thing to do. You could suggest it in reply to the
patch I just sent and see what people say.

Thanks,
-- Marco
Kefeng Wang Dec. 2, 2021, 11:44 a.m. UTC | #10
On 2021/12/2 18:58, Marco Elver wrote:
> On Thu, 2 Dec 2021 at 11:45, Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> [...]
>>>> for now and more explanation into changlog.
>>> So what I gather arm64's final select line may look like:
>>>
>>>        select HAVE_ARCH_KCSAN if EXPERT && (CC_IS_GCC || CLANG_VERSION >= 120000)
>> Yes,  that's what we want now.
> I think we have all the pieces sorted out, so hopefully you have
> everything you need to do v3.
OK, I will resend v3.
>
> [...]
>>> I actually have a "fix" for the data race in rwsem_spin_on_owner, that
>>> also shows where the other racing access comes from... which reminds me:
>>> https://lkml.kernel.org/r/20211202101238.33546-1-elver@google.com
>> There's a owner_on_cpu(),  we could reuse it,
> That looks like a reasonable thing to do, but can't say if it is
> actually the right thing to do. You could suggest it in reply to the
> patch I just sent and see what people say.
Could you cc me, I can't reply email due to no subscribe to mailing 
list, thanks.
diff mbox series

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 4ff73299f8a9..0ac90875f71d 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -150,6 +150,7 @@  config ARM64
 	select HAVE_ARCH_KASAN_VMALLOC if HAVE_ARCH_KASAN
 	select HAVE_ARCH_KASAN_SW_TAGS if HAVE_ARCH_KASAN
 	select HAVE_ARCH_KASAN_HW_TAGS if (HAVE_ARCH_KASAN && ARM64_MTE)
+	select HAVE_ARCH_KCSAN
 	select HAVE_ARCH_KFENCE
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_MMAP_RND_BITS
diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile
index 700767dfd221..60813497a381 100644
--- a/arch/arm64/kernel/vdso/Makefile
+++ b/arch/arm64/kernel/vdso/Makefile
@@ -32,6 +32,7 @@  ccflags-y += -DDISABLE_BRANCH_PROFILING -DBUILD_VDSO
 CFLAGS_REMOVE_vgettimeofday.o = $(CC_FLAGS_FTRACE) -Os $(CC_FLAGS_SCS) $(GCC_PLUGINS_CFLAGS) \
 				$(CC_FLAGS_LTO)
 KASAN_SANITIZE			:= n
+KCSAN_SANITIZE			:= n
 UBSAN_SANITIZE			:= n
 OBJECT_FILES_NON_STANDARD	:= y
 KCOV_INSTRUMENT			:= n
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index c3c11974fa3b..24b2c2425b38 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -89,6 +89,7 @@  KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_FTRACE) $(CC_FLAGS_SCS) $(CC_FLAGS_CFI)
 # cause crashes. Just disable it.
 GCOV_PROFILE	:= n
 KASAN_SANITIZE	:= n
+KCSAN_SANITIZE	:= n
 UBSAN_SANITIZE	:= n
 KCOV_INSTRUMENT	:= n
 
diff --git a/kernel/kcsan/Makefile b/kernel/kcsan/Makefile
index c2bb07f5bcc7..e893b0e1d62a 100644
--- a/kernel/kcsan/Makefile
+++ b/kernel/kcsan/Makefile
@@ -8,6 +8,7 @@  CFLAGS_REMOVE_debugfs.o = $(CC_FLAGS_FTRACE)
 CFLAGS_REMOVE_report.o = $(CC_FLAGS_FTRACE)
 
 CFLAGS_core.o := $(call cc-option,-fno-conserve-stack) \
+	$(call cc-option,-mno-outline-atomics) \
 	-fno-stack-protector -DDISABLE_BRANCH_PROFILING
 
 obj-y := core.o debugfs.o report.o