[00/15] kasan: x86: arm64: risc-v: KASAN tag-based mode for x86

Message ID	cover.1738686764.git.maciej.wieczor-retman@intel.com (mailing list archive)
Headers	show Return-Path: <linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org> From: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> To: luto@kernel.org, xin@zytor.com, kirill.shutemov@linux.intel.com, palmer@dabbelt.com, tj@kernel.org, andreyknvl@gmail.com, brgerst@gmail.com, ardb@kernel.org, dave.hansen@linux.intel.com, jgross@suse.com, will@kernel.org, akpm@linux-foundation.org, arnd@arndb.de, corbet@lwn.net, maciej.wieczor-retman@intel.com, dvyukov@google.com, richard.weiyang@gmail.com, ytcoode@gmail.com, tglx@linutronix.de, hpa@zytor.com, seanjc@google.com, paul.walmsley@sifive.com, aou@eecs.berkeley.edu, justinstitt@google.com, jason.andryuk@amd.com, glider@google.com, ubizjak@gmail.com, jannh@google.com, bhe@redhat.com, vincenzo.frascino@arm.com, rafael.j.wysocki@intel.com, ndesaulniers@google.com, mingo@redhat.com, catalin.marinas@arm.com, junichi.nomura@nec.com, nathan@kernel.org, ryabinin.a.a@gmail.com, dennis@kernel.org, bp@alien8.de, kevinloughlin@google.com, morbo@google.com, dan.j.williams@intel.com, julian.stecklina@cyberus-technology.de, peterz@infradead.org, cl@linux.com, kees@kernel.org Cc: kasan-dev@googlegroups.com, x86@kernel.org, linux-arm-kernel@lists.infradead.org, linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, llvm@lists.linux.dev, linux-doc@vger.kernel.org Subject: [PATCH 00/15] kasan: x86: arm64: risc-v: KASAN tag-based mode for x86 Date: Tue, 4 Feb 2025 18:33:41 +0100 Message-ID: <cover.1738686764.git.maciej.wieczor-retman@intel.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org> Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org
Series	kasan: x86: arm64: risc-v: KASAN tag-based mode for x86 \| expand [00/15] kasan: x86: arm64: risc-v: KASAN tag-based mode for x86 [01/15] kasan: Allocation enhancement for dense tag-based mode [02/15] kasan: Tag checking with dense tag-based mode [03/15] kasan: Vmalloc dense tag-based mode support [04/15] kasan: arm64: x86: risc-v: Make special tags arch specific [05/15] x86: Add arch specific kasan functions [06/15] x86: Reset tag for virtual to physical address conversions [07/15] mm: Pcpu chunk address tag reset [08/15] x86: Physical address comparisons in fill_p*d/pte [09/15] x86: Physical address comparison in current_mm pgd check [10/15] x86: KASAN raw shadow memory PTE init [11/15] x86: LAM initialization [12/15] x86: Minimal SLAB alignment [13/15] x86: runtime_const used for KASAN_SHADOW_END [14/15] x86: Make software tag-based kasan available [15/15] kasan: Add mititgation and debug modes

Maciej Wieczor-Retman Feb. 4, 2025, 5:33 p.m. UTC

======= Introduction
The patchset aims to add a KASAN tag-based mode for the x86 architecture
with the help of the new CPU feature called Linear Address Masking
(LAM). Main improvement introduced by the series is 4x lower memory
usage compared to KASAN's generic mode, the only currently available
mode on x86.

There are two logical parts to this series. The first one attempts to
add a new memory saving mechanism called "dense mode" to the generic
part of the tag-based KASAN code. The second one focuses on implementing
and enabling the tag-based mode for the x86 architecture by using LAM.

======= How KASAN tag-based mode works?
When enabled, memory accesses and allocations are augmented by the
compiler during kernel compilation. Instrumentation functions are added
to each memory allocation and each pointer dereference.

The allocation related functions generate a random tag and save it in
two places: in shadow memory that maps to the allocated memory, and in
the top bits of the pointer that points to the allocated memory. Storing
the tag in the top of the pointer is possible because of Top-Byte Ignore
(TBI) on arm64 architecture and LAM on x86.

The access related functions are performing a comparison between the tag
stored in the pointer and the one stored in shadow memory. If the tags
don't match an out of bounds error must have occurred and so an error
report is generated.

The general idea for the tag-based mode is very well explained in the
series with the original implementation [1].

[1] https://lore.kernel.org/all/cover.1544099024.git.andreyknvl@google.com/

======= What is the new "dense mode"?
To further save memory the dense mode is introduced. The idea is that
normally one shadow byte stores one tag and this one tag covers one
granule of allocated memory which is 16 bytes. In the dense mode, one
tag still covers 16 bytes of allocated memory but is shortened in length
from 8 bits to 4 bits which makes it possible to store two tags in one
shadow memory byte.

=== Example:
The example below shows how the shadow memory looks like after
allocating 48 bytes of memory in both normal tag-based mode and the
dense mode. The contents of shadow memory are overlaid onto address
offsets that they relate to in the allocated kernel memory. Each cell
|        | symbolizes one byte of shadow memory.

= The regular tag based mode:
- Randomly generated 8-bit tag equals 0xAB.
- 0xFE is the tag that symbolizes unallocated memory.

Shadow memory contents:           |  0xAB  |  0xAB  |  0xAB  |  0xFE  |
Shadow memory address offsets:    0        1        2        3        4
Allocated memory address offsets: 0        16       32       48       64

= The dense tag based mode:
- Randomly generated 4-bit tag equals 0xC.
- 0xE is the tag that symbolizes unallocated memory.

Shadow memory contents:           |0xC 0xC |0xC 0xE |0xE 0xE |0xE 0xE |
Shadow memory address offsets:    0        1        2        3        4
Allocated memory address offsets: 0        32       64       96       128

=== Dense mode benefits summary
For a small price of a couple of bit shifts, the dense mode uses only
half the memory compared to the current arm64 tag-based mode, while
still preserving the 16 byte tag granularity which allows catching
smaller offsets of out of bounds errors.

======= Differences summary compared to the arm64 tag-based mode
- Tag width:
	- Tag width influences the chance of a tag mismatch due to two
	  tags from different allocations having the same value. The
	  bigger the possible range of tag values the lower the chance
	  of that happening.
	- Shortening the tag width from 8 bits to 4, while helping with
	  memory usage also increases the chance of not reporting an
	  error. 4 bit tags have a ~7% chance of a tag mismatch.

- TBI and LAM
	- TBI in arm64 allows for storing metadata in the top 8 bits of
	  the virtual address.
	- LAM in x86 allows storing tags in bits [62:57] of the pointer.
	  To maximize memory savings the tag width is reduced to bits
	  [60:57].

======= Testing
Checked all the kunits for both software tags and generic KASAN after
making changes.

In generic mode the results were:

kasan: pass:59 fail:0 skip:13 total:72
Totals: pass:59 fail:0 skip:13 total:72
ok 1 kasan

and for software tags:

kasan: pass:63 fail:0 skip:9 total:72
Totals: pass:63 fail:0 skip:9 total:72
ok 1 kasan

======= Benchmarks
All tests were ran on a Sierra Forest server platform with 512GB of
memory. The only differences between the tests were kernel options:
	- CONFIG_KASAN
	- CONFIG_KASAN_GENERIC
	- CONFIG_KASAN_SW_TAGS
	- CONFIG_KASAN_INLINE [1]
	- CONFIG_KASAN_OUTLINE [1]

Used memory in GBs after boot [2][3]:
* 14 for clean kernel
* 91 / 90 for generic KASAN (inline/outline)
* 31 for tag-based KASAN

Boot time (until login prompt):
* 03:48 for clean kernel
* 08:02 / 09:45 for generic KASAN (inline/outline)
* 08:50 for dense tag-based KASAN
* 04:50 for dense tag-based KASAN with stacktrace disabled [4]

Compilation time comparison (10 cores):
* 7:27 for clean kernel
* 8:21/7:44 for generic KASAN (inline/outline)
* 7:41 for tag-based KASAN

Network performance [5]:
* 13.7 Gbits/sec for clean kernel
* 2.25 Gbits/sec for generic KASAN inline
* 1.50 Gbits/sec for generic KASAN outline
* 1.55 Gbits/sec for dense tag-based KASAN
* 2.86 Gbits/sec for dense tag-based KASAN with stacktrace disabled

[1] Based on hwasan and asan compiler parameters used in
scripts/Makefile.kasan it looks like inline/outline modes have a bigger
impact on generic mode than the tag-based mode. In the former inlining
actually increases the kernel image size and improves performance. In
the latter it un-inlines some code portions for debugging purposes when
the outline mode is chosen but no real difference is visible in
performance and kernel image size.

[2] Used "cat /proc/meminfo | grep MemAvailable" and then subtracted
that from the total memory of the system. Initially wanted to use "grep
Slab" similarly to the cover letter for arm64 tag-based series but
because the tests were ran on a system with 512GB of RAM and memory
usage was more split up between different categories this better shows
the memory savings.

[3] If the 14 GBs from the clean build were subtracted from the KASAN
measurements one can see that the tag-based mode uses about 4x less of
the additional memory compared to the generic mode.

[4] Memory allocation and freeing performance suffers heavily from saving
stacktraces that can be later displayed in error reports.

[5] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.

======= Compilation
Clang was used to compile the series (make LLVM=1) since gcc doesn't
seem to have support for KASAN tag-based compiler instrumentation on
x86.

======= Dependencies
Series is based on risc-v series [1] that's currently in review. Because
of this for the time being it only applies cleanly on top of 6.12
mainline kernel. Will rebase on the newest kernel once the risc-v series
is also rebased.

[1] https://lore.kernel.org/all/20241022015913.3524425-1-samuel.holland@sifive.com/

Maciej Wieczor-Retman (15):
  kasan: Allocation enhancement for dense tag-based mode
  kasan: Tag checking with dense tag-based mode
  kasan: Vmalloc dense tag-based mode support
  kasan: arm64: x86: risc-v: Make special tags arch specific
  x86: Add arch specific kasan functions
  x86: Reset tag for virtual to physical address conversions
  mm: Pcpu chunk address tag reset
  x86: Physical address comparisons in fill_p*d/pte
  x86: Physical address comparison in current_mm pgd check
  x86: KASAN raw shadow memory PTE init
  x86: LAM initialization
  x86: Minimal SLAB alignment
  x86: runtime_const used for KASAN_SHADOW_END
  x86: Make software tag-based kasan available
  kasan: Add mititgation and debug modes

 Documentation/arch/x86/x86_64/mm.rst |  6 +-
 MAINTAINERS                          |  2 +-
 arch/arm64/include/asm/kasan-tags.h  |  9 +++
 arch/riscv/include/asm/kasan-tags.h  | 12 ++++
 arch/riscv/include/asm/kasan.h       |  4 --
 arch/x86/Kconfig                     | 11 +++-
 arch/x86/boot/compressed/misc.h      |  2 +
 arch/x86/include/asm/kasan-tags.h    |  9 +++
 arch/x86/include/asm/kasan.h         | 50 +++++++++++++--
 arch/x86/include/asm/page.h          | 17 +++--
 arch/x86/include/asm/page_64.h       |  2 +-
 arch/x86/kernel/head_64.S            |  3 +
 arch/x86/kernel/setup.c              |  2 +
 arch/x86/kernel/vmlinux.lds.S        |  1 +
 arch/x86/mm/init.c                   |  3 +
 arch/x86/mm/init_64.c                |  8 +--
 arch/x86/mm/kasan_init_64.c          | 24 +++++--
 arch/x86/mm/physaddr.c               |  1 +
 arch/x86/mm/tlb.c                    |  2 +-
 include/linux/kasan-tags.h           | 12 +++-
 include/linux/kasan.h                | 94 +++++++++++++++++++++++-----
 include/linux/mm.h                   |  6 +-
 include/linux/page-flags-layout.h    |  7 +--
 lib/Kconfig.kasan                    | 49 +++++++++++++++
 mm/kasan/Makefile                    |  3 +
 mm/kasan/dense.c                     | 83 ++++++++++++++++++++++++
 mm/kasan/kasan.h                     | 27 +-------
 mm/kasan/report.c                    |  6 +-
 mm/kasan/report_sw_tags.c            | 12 ++--
 mm/kasan/shadow.c                    | 47 ++++++++++----
 mm/kasan/sw_tags.c                   |  8 +++
 mm/kasan/tags.c                      |  5 ++
 mm/percpu-vm.c                       |  2 +-
 33 files changed, 432 insertions(+), 97 deletions(-)
 create mode 100644 arch/arm64/include/asm/kasan-tags.h
 create mode 100644 arch/riscv/include/asm/kasan-tags.h
 create mode 100644 arch/x86/include/asm/kasan-tags.h
 create mode 100644 mm/kasan/dense.c

Christoph Lameter (Ampere) Feb. 4, 2025, 6:58 p.m. UTC | #1

ARM64 supports MTE which is hardware support for tagging 16 byte granules
and verification of tags in pointers all in hardware and on some platforms
with *no* performance penalty since the tag is stored in the ECC areas of
DRAM and verified at the same time as the ECC.

Could we get support for that? This would allow us to enable tag checking
in production systems without performance penalty and no memory overhead.

Dave Hansen Feb. 4, 2025, 9:05 p.m. UTC | #2

On 2/4/25 10:58, Christoph Lameter (Ampere) wrote:
> ARM64 supports MTE which is hardware support for tagging 16 byte granules
> and verification of tags in pointers all in hardware and on some platforms
> with *no* performance penalty since the tag is stored in the ECC areas of
> DRAM and verified at the same time as the ECC.
> 
> Could we get support for that? This would allow us to enable tag checking
> in production systems without performance penalty and no memory overhead.

At least on the Intel side, there's no trajectory for doing something
like the MTE architecture for memory tagging. The DRAM "ECC" area is in
very high demand and if anything things are moving away from using ECC
"bits" for anything other than actual ECC. Even the MKTME+integrity
(used for TDX) metadata is probably going to find a new home at some point.

This shouldn't be a surprise to anyone on cc here. If it is, you should
probably be reaching out to Intel over your normal channels.

Jessica Clarke Feb. 4, 2025, 11:36 p.m. UTC | #3

On 4 Feb 2025, at 18:58, Christoph Lameter (Ampere) <cl@gentwo.org> wrote:
> ARM64 supports MTE which is hardware support for tagging 16 byte granules
> and verification of tags in pointers all in hardware and on some platforms
> with *no* performance penalty since the tag is stored in the ECC areas of
> DRAM and verified at the same time as the ECC.
> 
> Could we get support for that? This would allow us to enable tag checking
> in production systems without performance penalty and no memory overhead.

It’s not “no performance penalty”, there is a cost to tracking the MTE
tags for checking. In asynchronous (or asymmetric) mode that’s not too
bad, but in synchronous mode there is a significant overhead even with
ECC. Normally on a store, once you’ve translated it and have the data,
you can buffer it up and defer the actual write until some time later.
If you hit in the L1 cache then that will probably be quite soon, but
if you miss then you have to wait for the data to come back from lower
levels of the hierarchy, potentially all the way out to DRAM. Or if you
have a write-around cache then you just send it out to the next level
when it’s ready. But now, if you have synchronous MTE, you cannot
retire your store instruction until you know what the tag for the
location you’re storing to is; effectively you have to wait until you
can do the full cache lookup, and potentially miss, until it can
retire. This puts pressure on the various microarchitectural structures
that track instructions as they get executed, as instructions are now
in flight for longer. Yes, it may well be that it is quicker for the
memory controller to get the tags from ECC bits than via some other
means, but you’re already paying many many cycles at that point, with
the relevant store being stuck unable to retire (and thus every
instruction after it in the instruction stream) that whole time, and no
write allocate or write around schemes can help you, because you
fundamentally have to wait for the tags to be read before you know if
the instruction is going to trap.

Now, you can choose to not use synchronous mode due to that overhead,
but that’s nuance that isn’t considered by your reply here and has some
consequences.

Jess

Jessica Clarke Feb. 4, 2025, 11:36 p.m. UTC | #4

On 4 Feb 2025, at 18:58, Christoph Lameter (Ampere) <cl@gentwo.org> wrote:
> ARM64 supports MTE which is hardware support for tagging 16 byte granules
> and verification of tags in pointers all in hardware and on some platforms
> with *no* performance penalty since the tag is stored in the ECC areas of
> DRAM and verified at the same time as the ECC.
> 
> Could we get support for that? This would allow us to enable tag checking
> in production systems without performance penalty and no memory overhead.

It’s not “no performance penalty”, there is a cost to tracking the MTE
tags for checking. In asynchronous (or asymmetric) mode that’s not too
bad, but in synchronous mode there is a significant overhead even with
ECC. Normally on a store, once you’ve translated it and have the data,
you can buffer it up and defer the actual write until some time later.
If you hit in the L1 cache then that will probably be quite soon, but
if you miss then you have to wait for the data to come back from lower
levels of the hierarchy, potentially all the way out to DRAM. Or if you
have a write-around cache then you just send it out to the next level
when it’s ready. But now, if you have synchronous MTE, you cannot
retire your store instruction until you know what the tag for the
location you’re storing to is; effectively you have to wait until you
can do the full cache lookup, and potentially miss, until it can
retire. This puts pressure on the various microarchitectural structures
that track instructions as they get executed, as instructions are now
in flight for longer. Yes, it may well be that it is quicker for the
memory controller to get the tags from ECC bits than via some other
means, but you’re already paying many many cycles at that point, with
the relevant store being stuck unable to retire (and thus every
instruction after it in the instruction stream) that whole time, and no
write allocate or write around schemes can help you, because you
fundamentally have to wait for the tags to be read before you know if
the instruction is going to trap.

Now, you can choose to not use synchronous mode due to that overhead,
but that’s nuance that isn’t considered by your reply here and has some
consequences.

Jess

Christoph Lameter (Ampere) Feb. 5, 2025, 6:51 p.m. UTC | #5

On Tue, 4 Feb 2025, Jessica Clarke wrote:

> It’s not “no performance penalty”, there is a cost to tracking the MTE
> tags for checking. In asynchronous (or asymmetric) mode that’s not too

On Ampere Processor hardware there is no penalty since the logic is build
into the usual read/write paths. This is by design. There may be on other
platforms that cannot do this.

Christoph Lameter (Ampere) Feb. 5, 2025, 6:59 p.m. UTC | #6

On Tue, 4 Feb 2025, Dave Hansen wrote:

> > Could we get support for that? This would allow us to enable tag checking
> > in production systems without performance penalty and no memory overhead.
>
> At least on the Intel side, there's no trajectory for doing something
> like the MTE architecture for memory tagging. The DRAM "ECC" area is in
> very high demand and if anything things are moving away from using ECC
> "bits" for anything other than actual ECC. Even the MKTME+integrity
> (used for TDX) metadata is probably going to find a new home at some point.
>
> This shouldn't be a surprise to anyone on cc here. If it is, you should
> probably be reaching out to Intel over your normal channels.

Intel was a competitor for our company and AFAICT has issues all over
the place with performance given its conservative stands on technology. But
we do not test against Intel anymore. Can someone from AMD say something?

MTE tagging is part of the processor standard for ARM64 and Linux will
need to support the 16 byte tagging feature one way or another even if
Intel does not like it. And AFAICT hardware tagging support is a critical
security feature for the future.

[00/15] kasan: x86: arm64: risc-v: KASAN tag-based mode for x86

Message

Comments