diff mbox

[RFC,00/37] Add support for arm64 MTE dynamic tag storage reuse

Message ID 20230823131350.114942-1-alexandru.elisei@arm.com (mailing list archive)
State Handled Elsewhere
Headers show

Commit Message

Alexandru Elisei Aug. 23, 2023, 1:13 p.m. UTC
Introduction
============

Arm has implemented memory coloring in hardware, and the feature is called
Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in bits
59..56 of a pointer, and storing this tag to a reserved memory location.
When the pointer is dereferenced, the hardware compares the tag embedded in
the pointer (logical tag) with the tag stored in memory (allocation tag).

The relation between memory and where the tag for that memory is stored is
static.

The memory where the tags are stored have been so far unaccessible to Linux.
This series aims to change that, by adding support for using the tag storage
memory only as data memory; tag storage memory cannot be itself tagged.


Implementation
==============

The series is based on v6.5-rc3 with these two patches cherry picked:

- mm: Call arch_swap_restore() from unuse_pte():

    https://lore.kernel.org/all/20230523004312.1807357-3-pcc@google.com/

- arm64: mte: Simplify swap tag restoration logic:

    https://lore.kernel.org/all/20230523004312.1807357-4-pcc@google.com/

The above two patches are queued for the v6.6 merge window:

    https://lore.kernel.org/all/20230702123821.04e64ea2c04dd0fdc947bda3@linux-foundation.org/

The entire series, including the above patches, can be cloned with:

$ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
	-b arm-mte-dynamic-carveout-rfc-v1

On the arm64 architecture side, an extension is being worked on that will
clarify how MTE tag storage reuse should behave. The extension will be
made public soon.

On the Linux side, MTE tag storage reuse is accomplished with the
following changes:

1. The tag storage memory is exposed to the memory allocator as a new
migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA, with
the restriction that it cannot be used to allocate tagged memory (tag
storage memory cannot be tagged). On tagged page allocation, the
corresponding tag storage is reserved via alloc_contig_range().

2. mprotect(PROT_MTE) is implemented by changing the pte prot to
PAGE_METADATA_NONE. When the page is next accessed, a fault is taken and
the corresponding tag storage is reserved.

3. When the code tries to copy tags to a page which doesn't have the tag
storage reserved, the tags are copied to an xarray and restored in
set_pte_at(), when the page is eventually mapped with the tag storage
reserved.

KVM support has not been implemented yet, that because a non-MTE enabled VMA
can back the memory of an MTE-enabled VM. After there is a consensus on the
right approach on the memory management support, I will add it.

Explanations for the last two changes follow. The gist of it is that they
were added mostly because of races, and it my intention to make the code
more robust.

PAGE_METADATA_NONE was introduced to avoid races with mprotect(PROT_MTE).
For example, migration can race with mprotect(PROT_MTE):
- thread 0 initiates migration for a page in a non-MTE enabled VMA and a
  destination page is allocated without tag storage.
- thread 1 handles an mprotect(PROT_MTE), the VMA becomes tagged, and an
  access turns the source page that is in the process of being migrated
  into a tagged page.
- thread 0 finishes migration and the destination page is mapped as tagged,
  but without tag storage reserved.
More details and examples can be found in the patches.

This race is also related to how tag restoring is handled when tag storage
is missing: when a tagged page is swapped out, the tags are saved in an
xarray indexed by swp_entry.val. When a page is swapped back in, if there
are tags corresponding to the swp_entry that the page will replace, the
tags are unconditionally restored, even if the page will be mapped as
untagged. Because the page will be mapped as untagged, tag storage was
not reserved when the page was allocated to replace the swp_entry which has
tags associated with it.

To get around this, save the tags in a new xarray, this time indexed by
pfn, and restore them when the same page is mapped as tagged.

This also solves another race, this time with copy_highpage. In the
scenario where migration races with mprotect(PROT_MTE), before the page is
mapped, the contents of the source page is copied to the destination. And
this includes tags, which will be copied to a page with missing tag
storage, which can to data corruption if the missing tag storage is in use
for data. So copy_highpage() has received a similar treatment to the swap
code, and the source tags are copied in the xarray indexed by the
destination page pfn.


Overview of the patches
=======================

Patches 1-3 do some preparatory work by renaming a few functions and a gfp
flag.

Patches 4-12 are arch independent and introduce MIGRATE_METADATA to the
page allocator.

Patches 13-18 are arm64 specific and add support for detecting the tag
storage region and onlining it with the MIGRATE_METADATA migratetype.

Patches 19-24 are arch independent and modify the page allocator to
callback into arch dependant functions to reserve metadata storage for an
allocation which requires metadata.

Patches 25-28 are mostly arm64 specific and implement the reservation and
freeing of tag storage on tagged page allocation. Patch #28 ("mm: sched:
Introduce PF_MEMALLOC_ISOLATE") adds a current flag, PF_MEMALLOC_ISOLATE,
which ignores page isolation limits; this is used by arm64 when reserving
tag storage in the same patch.

Patches 29-30 add arch independent support for doing mprotect(PROT_MTE)
when metadata storage is enabled.

Patches 31-37 are mostly arm64 specific and handle the restoring of tags
when tag storage is missing. The exceptions are patches 32 (adds the
arch_swap_prepare_to_restore() function) and 35 (add PAGE_METADATA_NONE
support for THPs).

Testing
=======

To enable MTE dynamic tag storage:

- CONFIG_ARM64_MTE_TAG_STORAGE=y
- system_supports_mte() returns true
- kasan_hw_tags_enabled() returns false
- correct DTB node (for the specification, see commit "arm64: mte: Reserve tag
  storage memory")

Check dmesg for the message "MTE tag storage enabled" or grep for metadata
in /proc/vmstat.

I've tested the series using FVP with MTE enabled, but without support for
dynamic tag storage reuse. To simulate it, I've added two fake tag storage
regions in the DTB by splitting a 2GB region roughly into 33 slices of size
0x3e0_0000, and using 32 of them for tagged memory and one slice for tag
storage:



Alexandru Elisei (37):
  mm: page_alloc: Rename gfp_to_alloc_flags_cma ->
    gfp_to_alloc_flags_fast
  arm64: mte: Rework naming for tag manipulation functions
  arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
  mm: Add MIGRATE_METADATA allocation policy
  mm: Add memory statistics for the MIGRATE_METADATA allocation policy
  mm: page_alloc: Allocate from movable pcp lists only if
    ALLOC_FROM_METADATA
  mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages
  mm: compaction: Account for free metadata pages in
    __compact_finished()
  mm: compaction: Handle metadata pages as source for direct compaction
  mm: compaction: Do not use MIGRATE_METADATA to replace pages with
    metadata
  mm: migrate/mempolicy: Allocate metadata-enabled destination page
  mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages
  arm64: mte: Reserve tag storage memory
  arm64: mte: Expose tag storage pages to the MIGRATE_METADATA freelist
  arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
  arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is disabled
  arm64: mte: Disable dynamic tag storage management if HW KASAN is
    enabled
  arm64: mte: Check that tag storage blocks are in the same zone
  mm: page_alloc: Manage metadata storage on page allocation
  mm: compaction: Reserve metadata storage in compaction_alloc()
  mm: khugepaged: Handle metadata-enabled VMAs
  mm: shmem: Allocate metadata storage for in-memory filesystems
  mm: Teach vma_alloc_folio() about metadata-enabled VMAs
  mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA
  arm64: mte: Manage tag storage on page allocation
  arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free
  arm64: mte: Reserve tag block for the zero page
  mm: sched: Introduce PF_MEMALLOC_ISOLATE
  mm: arm64: Define the PAGE_METADATA_NONE page protection
  mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)
  mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing metadata
    storage
  mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
  arm64: mte: swap/copypage: Handle tag restoring when missing tag
    storage
  arm64: mte: Handle fatal signal in reserve_metadata_storage()
  mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages
  KVM: arm64: Disable MTE is tag storage is enabled
  arm64: mte: Enable tag storage management

 arch/arm64/Kconfig                       |  13 +
 arch/arm64/include/asm/assembler.h       |  10 +
 arch/arm64/include/asm/memory_metadata.h |  49 ++
 arch/arm64/include/asm/mte-def.h         |  16 +-
 arch/arm64/include/asm/mte.h             |  40 +-
 arch/arm64/include/asm/mte_tag_storage.h |  36 ++
 arch/arm64/include/asm/page.h            |   5 +-
 arch/arm64/include/asm/pgtable-prot.h    |   2 +
 arch/arm64/include/asm/pgtable.h         |  33 +-
 arch/arm64/kernel/Makefile               |   1 +
 arch/arm64/kernel/elfcore.c              |  14 +-
 arch/arm64/kernel/hibernate.c            |  46 +-
 arch/arm64/kernel/mte.c                  |  31 +-
 arch/arm64/kernel/mte_tag_storage.c      | 667 +++++++++++++++++++++++
 arch/arm64/kernel/setup.c                |   7 +
 arch/arm64/kvm/arm.c                     |   6 +-
 arch/arm64/lib/mte.S                     |  30 +-
 arch/arm64/mm/copypage.c                 |  26 +
 arch/arm64/mm/fault.c                    |  35 +-
 arch/arm64/mm/mteswap.c                  | 113 +++-
 fs/proc/meminfo.c                        |   8 +
 fs/proc/page.c                           |   1 +
 include/asm-generic/Kbuild               |   1 +
 include/asm-generic/memory_metadata.h    |  50 ++
 include/linux/gfp.h                      |  10 +
 include/linux/gfp_types.h                |  14 +-
 include/linux/huge_mm.h                  |   6 +
 include/linux/kernel-page-flags.h        |   1 +
 include/linux/migrate_mode.h             |   1 +
 include/linux/mm.h                       |  12 +-
 include/linux/mmzone.h                   |  26 +-
 include/linux/page-flags.h               |   1 +
 include/linux/pgtable.h                  |  19 +
 include/linux/sched.h                    |   2 +-
 include/linux/sched/mm.h                 |  13 +
 include/linux/vm_event_item.h            |   5 +
 include/linux/vmstat.h                   |   2 +
 include/trace/events/mmflags.h           |   5 +-
 mm/Kconfig                               |   5 +
 mm/compaction.c                          |  52 +-
 mm/huge_memory.c                         | 109 ++++
 mm/internal.h                            |   7 +
 mm/khugepaged.c                          |   7 +
 mm/memory.c                              | 180 +++++-
 mm/mempolicy.c                           |   7 +
 mm/migrate.c                             |   6 +
 mm/mm_init.c                             |  23 +-
 mm/mprotect.c                            |  46 ++
 mm/page_alloc.c                          | 136 ++++-
 mm/page_isolation.c                      |  19 +-
 mm/page_owner.c                          |   3 +-
 mm/shmem.c                               |  14 +-
 mm/show_mem.c                            |   4 +
 mm/swapfile.c                            |   4 +
 mm/vmscan.c                              |   3 +
 mm/vmstat.c                              |  13 +-
 56 files changed, 1834 insertions(+), 161 deletions(-)
 create mode 100644 arch/arm64/include/asm/memory_metadata.h
 create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
 create mode 100644 arch/arm64/kernel/mte_tag_storage.c
 create mode 100644 include/asm-generic/memory_metadata.h

Comments

David Hildenbrand Aug. 24, 2023, 7:50 a.m. UTC | #1
On 23.08.23 15:13, Alexandru Elisei wrote:
> Introduction
> ============
> 
> Arm has implemented memory coloring in hardware, and the feature is called
> Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in bits
> 59..56 of a pointer, and storing this tag to a reserved memory location.
> When the pointer is dereferenced, the hardware compares the tag embedded in
> the pointer (logical tag) with the tag stored in memory (allocation tag).
> 
> The relation between memory and where the tag for that memory is stored is
> static.
> 
> The memory where the tags are stored have been so far unaccessible to Linux.
> This series aims to change that, by adding support for using the tag storage
> memory only as data memory; tag storage memory cannot be itself tagged.
> 
> 
> Implementation
> ==============
> 
> The series is based on v6.5-rc3 with these two patches cherry picked:
> 
> - mm: Call arch_swap_restore() from unuse_pte():
> 
>      https://lore.kernel.org/all/20230523004312.1807357-3-pcc@google.com/
> 
> - arm64: mte: Simplify swap tag restoration logic:
> 
>      https://lore.kernel.org/all/20230523004312.1807357-4-pcc@google.com/
> 
> The above two patches are queued for the v6.6 merge window:
> 
>      https://lore.kernel.org/all/20230702123821.04e64ea2c04dd0fdc947bda3@linux-foundation.org/
> 
> The entire series, including the above patches, can be cloned with:
> 
> $ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
> 	-b arm-mte-dynamic-carveout-rfc-v1
> 
> On the arm64 architecture side, an extension is being worked on that will
> clarify how MTE tag storage reuse should behave. The extension will be
> made public soon.
> 
> On the Linux side, MTE tag storage reuse is accomplished with the
> following changes:
> 
> 1. The tag storage memory is exposed to the memory allocator as a new
> migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA, with
> the restriction that it cannot be used to allocate tagged memory (tag
> storage memory cannot be tagged). On tagged page allocation, the
> corresponding tag storage is reserved via alloc_contig_range().
> 
> 2. mprotect(PROT_MTE) is implemented by changing the pte prot to
> PAGE_METADATA_NONE. When the page is next accessed, a fault is taken and
> the corresponding tag storage is reserved.
> 
> 3. When the code tries to copy tags to a page which doesn't have the tag
> storage reserved, the tags are copied to an xarray and restored in
> set_pte_at(), when the page is eventually mapped with the tag storage
> reserved.

Hi!

after re-reading it 2 times, I still have no clue what your patch set is 
actually trying to achieve. Probably there is a way to describe how user 
space intents to interact with this feature, so to see which value this 
actually has for user space -- and if we are using the right APIs and 
allocators.

So some dummy questions / statements

1) Is this about re-propusing the memory used to hold tags for different 
purpose? Or what exactly is user space going to do with the PROT_MTE 
memory? The whole mprotect(PROT_MTE) approach might not eb the right 
thing to do.

2) Why do we even have to involve the page allocator if this is some 
special-purpose memory? Re-porpusing the buddy when later using 
alloc_contig_range() either way feels wrong.


[...]

>   arch/arm64/Kconfig                       |  13 +
>   arch/arm64/include/asm/assembler.h       |  10 +
>   arch/arm64/include/asm/memory_metadata.h |  49 ++
>   arch/arm64/include/asm/mte-def.h         |  16 +-
>   arch/arm64/include/asm/mte.h             |  40 +-
>   arch/arm64/include/asm/mte_tag_storage.h |  36 ++
>   arch/arm64/include/asm/page.h            |   5 +-
>   arch/arm64/include/asm/pgtable-prot.h    |   2 +
>   arch/arm64/include/asm/pgtable.h         |  33 +-
>   arch/arm64/kernel/Makefile               |   1 +
>   arch/arm64/kernel/elfcore.c              |  14 +-
>   arch/arm64/kernel/hibernate.c            |  46 +-
>   arch/arm64/kernel/mte.c                  |  31 +-
>   arch/arm64/kernel/mte_tag_storage.c      | 667 +++++++++++++++++++++++
>   arch/arm64/kernel/setup.c                |   7 +
>   arch/arm64/kvm/arm.c                     |   6 +-
>   arch/arm64/lib/mte.S                     |  30 +-
>   arch/arm64/mm/copypage.c                 |  26 +
>   arch/arm64/mm/fault.c                    |  35 +-
>   arch/arm64/mm/mteswap.c                  | 113 +++-
>   fs/proc/meminfo.c                        |   8 +
>   fs/proc/page.c                           |   1 +
>   include/asm-generic/Kbuild               |   1 +
>   include/asm-generic/memory_metadata.h    |  50 ++
>   include/linux/gfp.h                      |  10 +
>   include/linux/gfp_types.h                |  14 +-
>   include/linux/huge_mm.h                  |   6 +
>   include/linux/kernel-page-flags.h        |   1 +
>   include/linux/migrate_mode.h             |   1 +
>   include/linux/mm.h                       |  12 +-
>   include/linux/mmzone.h                   |  26 +-
>   include/linux/page-flags.h               |   1 +
>   include/linux/pgtable.h                  |  19 +
>   include/linux/sched.h                    |   2 +-
>   include/linux/sched/mm.h                 |  13 +
>   include/linux/vm_event_item.h            |   5 +
>   include/linux/vmstat.h                   |   2 +
>   include/trace/events/mmflags.h           |   5 +-
>   mm/Kconfig                               |   5 +
>   mm/compaction.c                          |  52 +-
>   mm/huge_memory.c                         | 109 ++++
>   mm/internal.h                            |   7 +
>   mm/khugepaged.c                          |   7 +
>   mm/memory.c                              | 180 +++++-
>   mm/mempolicy.c                           |   7 +
>   mm/migrate.c                             |   6 +
>   mm/mm_init.c                             |  23 +-
>   mm/mprotect.c                            |  46 ++
>   mm/page_alloc.c                          | 136 ++++-
>   mm/page_isolation.c                      |  19 +-
>   mm/page_owner.c                          |   3 +-
>   mm/shmem.c                               |  14 +-
>   mm/show_mem.c                            |   4 +
>   mm/swapfile.c                            |   4 +
>   mm/vmscan.c                              |   3 +
>   mm/vmstat.c                              |  13 +-
>   56 files changed, 1834 insertions(+), 161 deletions(-)
>   create mode 100644 arch/arm64/include/asm/memory_metadata.h
>   create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
>   create mode 100644 arch/arm64/kernel/mte_tag_storage.c
>   create mode 100644 include/asm-generic/memory_metadata.h

The core-mm changes don't look particularly appealing :)
Catalin Marinas Aug. 24, 2023, 10:44 a.m. UTC | #2
On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote:
> after re-reading it 2 times, I still have no clue what your patch set is
> actually trying to achieve. Probably there is a way to describe how user
> space intents to interact with this feature, so to see which value this
> actually has for user space -- and if we are using the right APIs and
> allocators.

I'll try with an alternative summary, hopefully it becomes clearer (I
think Alex is away until the end of the week, may not reply
immediately). If this still doesn't work, maybe we should try a
different implementation ;).

The way MTE is implemented currently is to have a static carve-out of
the DRAM to store the allocation tags (a.k.a. memory colour). This is
what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
done transparently by the hardware/interconnect (with firmware setup)
and normally hidden from the OS. So a checked memory access to location
X generates a tag fetch from location Y in the carve-out and this tag is
compared with the bits 59:56 in the pointer. The correspondence from X
to Y is linear (subject to a minimum block size to deal with some
address interleaving). The software doesn't need to know about this
correspondence as we have specific instructions like STG/LDG to location
X that lead to a tag store/load to Y.

Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
For example, some large allocations may not use PROT_MTE at all or only
for the first and last page since initialising the tags takes time. The
side-effect is that of these 3% DRAM, only part, say 1% is effectively
used. Some people want the unused tag storage to be released for normal
data usage (i.e. give it to the kernel page allocator).

So the first complication is that a PROT_MTE page allocation at address
X will need to reserve the tag storage at location Y (and migrate any
data in that page if it is in use).

To make things worse, pages in the tag storage/carve-out range cannot
use PROT_MTE themselves on current hardware, so this adds the second
complication - a heterogeneous memory layout. The kernel needs to know
where to allocate a PROT_MTE page from or migrate a current page if it
becomes PROT_MTE (mprotect()) and the range it is in does not support
tagging.

Some other complications are arm64-specific like cache coherency between
tags and data accesses. There is a draft architecture spec which will be
released soon, detailing how the hardware behaves.

To your question about user APIs/ABIs, that's entirely transparent. As
with the current kernel (without this dynamic tag storage), a user only
needs to ask for PROT_MTE mappings to get tagged pages.

> So some dummy questions / statements
> 
> 1) Is this about re-propusing the memory used to hold tags for different
> purpose?

Yes. To allow part of this 3% to be used for data. It could even be the
whole 3% if no application is enabling MTE.

> Or what exactly is user space going to do with the PROT_MTE memory?
> The whole mprotect(PROT_MTE) approach might not eb the right thing to do.

As I mentioned above, there's no difference to the user ABI. PROT_MTE
works as before with the kernel moving pages around as needed.

> 2) Why do we even have to involve the page allocator if this is some
> special-purpose memory? Re-porpusing the buddy when later using
> alloc_contig_range() either way feels wrong.

The aim here is to rebrand this special-purpose memory as a nearly
general-purpose one (bar the PROT_MTE restriction).

> The core-mm changes don't look particularly appealing :)

OTOH, it's a fun project to learn about the mm ;).

Our aim for now is to get some feedback from the mm community on whether
this special -> nearly general rebranding is acceptable together with
the introduction of a heterogeneous memory concept for the general
purpose page allocator.

There are some alternatives we looked at with a smaller mm impact but we
haven't prototyped them yet: (a) use the available tag storage as a
frontswap accelerator or (b) use it as a (compressed) ramdisk that can
be mounted as swap. The latter has the advantage of showing up in the
available total memory, keeps customers happy ;). Both options would
need some mm hooks when a PROT_MTE page gets allocated to release the
corresponding page in the tag storage range.
David Hildenbrand Aug. 24, 2023, 11:06 a.m. UTC | #3
On 24.08.23 12:44, Catalin Marinas wrote:
> On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote:
>> after re-reading it 2 times, I still have no clue what your patch set is
>> actually trying to achieve. Probably there is a way to describe how user
>> space intents to interact with this feature, so to see which value this
>> actually has for user space -- and if we are using the right APIs and
>> allocators.
> 
> I'll try with an alternative summary, hopefully it becomes clearer (I
> think Alex is away until the end of the week, may not reply
> immediately). If this still doesn't work, maybe we should try a
> different implementation ;).
> 
> The way MTE is implemented currently is to have a static carve-out of
> the DRAM to store the allocation tags (a.k.a. memory colour). This is
> what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
> means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
> done transparently by the hardware/interconnect (with firmware setup)
> and normally hidden from the OS. So a checked memory access to location
> X generates a tag fetch from location Y in the carve-out and this tag is
> compared with the bits 59:56 in the pointer. The correspondence from X
> to Y is linear (subject to a minimum block size to deal with some
> address interleaving). The software doesn't need to know about this
> correspondence as we have specific instructions like STG/LDG to location
> X that lead to a tag store/load to Y.
> 
> Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
> For example, some large allocations may not use PROT_MTE at all or only
> for the first and last page since initialising the tags takes time. The
> side-effect is that of these 3% DRAM, only part, say 1% is effectively
> used. Some people want the unused tag storage to be released for normal
> data usage (i.e. give it to the kernel page allocator).
> 
> So the first complication is that a PROT_MTE page allocation at address
> X will need to reserve the tag storage at location Y (and migrate any
> data in that page if it is in use).
> 
> To make things worse, pages in the tag storage/carve-out range cannot
> use PROT_MTE themselves on current hardware, so this adds the second
> complication - a heterogeneous memory layout. The kernel needs to know
> where to allocate a PROT_MTE page from or migrate a current page if it
> becomes PROT_MTE (mprotect()) and the range it is in does not support
> tagging.
> 
> Some other complications are arm64-specific like cache coherency between
> tags and data accesses. There is a draft architecture spec which will be
> released soon, detailing how the hardware behaves.
> 
> To your question about user APIs/ABIs, that's entirely transparent. As
> with the current kernel (without this dynamic tag storage), a user only
> needs to ask for PROT_MTE mappings to get tagged pages.

Thanks, that clarifies things a lot.

So it sounds like you might want to provide that tag memory using CMA.

That way, only movable allocations can end up on that CMA memory area, 
and you can allocate selected tag pages on demand (similar to the 
alloc_contig_range() use case).

That also solves the issue that such tag memory must not be longterm-pinned.

Regarding one complication: "The kernel needs to know where to allocate 
a PROT_MTE page from or migrate a current page if it becomes PROT_MTE 
(mprotect()) and the range it is in does not support tagging.", 
simplified handling would be if it's in a MIGRATE_CMA pageblock, it 
doesn't support tagging. You have to migrate to a !CMA page (for 
example, not specifying GFP_MOVABLE as a quick way to achieve that).

(I have no idea how tag/tagged memory interacts with memory hotplug, I 
assume it just doesn't work)

> 
>> So some dummy questions / statements
>>
>> 1) Is this about re-propusing the memory used to hold tags for different
>> purpose?
> 
> Yes. To allow part of this 3% to be used for data. It could even be the
> whole 3% if no application is enabling MTE.
> 
>> Or what exactly is user space going to do with the PROT_MTE memory?
>> The whole mprotect(PROT_MTE) approach might not eb the right thing to do.
> 
> As I mentioned above, there's no difference to the user ABI. PROT_MTE
> works as before with the kernel moving pages around as needed.
> 
>> 2) Why do we even have to involve the page allocator if this is some
>> special-purpose memory? Re-porpusing the buddy when later using
>> alloc_contig_range() either way feels wrong.
> 
> The aim here is to rebrand this special-purpose memory as a nearly
> general-purpose one (bar the PROT_MTE restriction).
> 
>> The core-mm changes don't look particularly appealing :)
> 
> OTOH, it's a fun project to learn about the mm ;).
> 
> Our aim for now is to get some feedback from the mm community on whether
> this special -> nearly general rebranding is acceptable together with
> the introduction of a heterogeneous memory concept for the general
> purpose page allocator.
> 
> There are some alternatives we looked at with a smaller mm impact but we
> haven't prototyped them yet: (a) use the available tag storage as a
> frontswap accelerator or (b) use it as a (compressed) ramdisk that can

Frontswap is no more :)

> be mounted as swap. The latter has the advantage of showing up in the
> available total memory, keeps customers happy ;). Both options would
> need some mm hooks when a PROT_MTE page gets allocated to release the
> corresponding page in the tag storage range.

Yes, some way of MM integration would be required. If CMA could get the 
job done, you might get most of what you need already.
David Hildenbrand Aug. 24, 2023, 11:25 a.m. UTC | #4
On 24.08.23 13:06, David Hildenbrand wrote:
> On 24.08.23 12:44, Catalin Marinas wrote:
>> On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote:
>>> after re-reading it 2 times, I still have no clue what your patch set is
>>> actually trying to achieve. Probably there is a way to describe how user
>>> space intents to interact with this feature, so to see which value this
>>> actually has for user space -- and if we are using the right APIs and
>>> allocators.
>>
>> I'll try with an alternative summary, hopefully it becomes clearer (I
>> think Alex is away until the end of the week, may not reply
>> immediately). If this still doesn't work, maybe we should try a
>> different implementation ;).
>>
>> The way MTE is implemented currently is to have a static carve-out of
>> the DRAM to store the allocation tags (a.k.a. memory colour). This is
>> what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
>> means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
>> done transparently by the hardware/interconnect (with firmware setup)
>> and normally hidden from the OS. So a checked memory access to location
>> X generates a tag fetch from location Y in the carve-out and this tag is
>> compared with the bits 59:56 in the pointer. The correspondence from X
>> to Y is linear (subject to a minimum block size to deal with some
>> address interleaving). The software doesn't need to know about this
>> correspondence as we have specific instructions like STG/LDG to location
>> X that lead to a tag store/load to Y.
>>
>> Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
>> For example, some large allocations may not use PROT_MTE at all or only
>> for the first and last page since initialising the tags takes time. The
>> side-effect is that of these 3% DRAM, only part, say 1% is effectively
>> used. Some people want the unused tag storage to be released for normal
>> data usage (i.e. give it to the kernel page allocator).
>>
>> So the first complication is that a PROT_MTE page allocation at address
>> X will need to reserve the tag storage at location Y (and migrate any
>> data in that page if it is in use).
>>
>> To make things worse, pages in the tag storage/carve-out range cannot
>> use PROT_MTE themselves on current hardware, so this adds the second
>> complication - a heterogeneous memory layout. The kernel needs to know
>> where to allocate a PROT_MTE page from or migrate a current page if it
>> becomes PROT_MTE (mprotect()) and the range it is in does not support
>> tagging.
>>
>> Some other complications are arm64-specific like cache coherency between
>> tags and data accesses. There is a draft architecture spec which will be
>> released soon, detailing how the hardware behaves.
>>
>> To your question about user APIs/ABIs, that's entirely transparent. As
>> with the current kernel (without this dynamic tag storage), a user only
>> needs to ask for PROT_MTE mappings to get tagged pages.
> 
> Thanks, that clarifies things a lot.
> 
> So it sounds like you might want to provide that tag memory using CMA.
> 
> That way, only movable allocations can end up on that CMA memory area,
> and you can allocate selected tag pages on demand (similar to the
> alloc_contig_range() use case).
> 
> That also solves the issue that such tag memory must not be longterm-pinned.
> 
> Regarding one complication: "The kernel needs to know where to allocate
> a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> (mprotect()) and the range it is in does not support tagging.",
> simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> doesn't support tagging. You have to migrate to a !CMA page (for
> example, not specifying GFP_MOVABLE as a quick way to achieve that).
> 

Okay, I now realize that this patch set effectively duplicates some CMA 
behavior using a new migrate-type. Yeah, that's probably not what we 
want just to identify if memory is taggable or not.

Maybe there is a way to just keep reusing most of CMA instead.


Another simpler idea to get started would be to just intercept the first 
PROT_MTE, and allocate all CMA memory. In that case, systems that don't 
ever use PROT_MTE can have that additional 3% of memory.

You probably know better how frequent it is that only a handful of 
applications use PROT_MTE, such that there is still a significant 
portion of tag memory to be reused (and if it's really worth optimizing 
for that scenario).
Catalin Marinas Aug. 24, 2023, 3:24 p.m. UTC | #5
On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> On 24.08.23 13:06, David Hildenbrand wrote:
> > On 24.08.23 12:44, Catalin Marinas wrote:
> > > The way MTE is implemented currently is to have a static carve-out of
> > > the DRAM to store the allocation tags (a.k.a. memory colour). This is
> > > what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
> > > means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
> > > done transparently by the hardware/interconnect (with firmware setup)
> > > and normally hidden from the OS. So a checked memory access to location
> > > X generates a tag fetch from location Y in the carve-out and this tag is
> > > compared with the bits 59:56 in the pointer. The correspondence from X
> > > to Y is linear (subject to a minimum block size to deal with some
> > > address interleaving). The software doesn't need to know about this
> > > correspondence as we have specific instructions like STG/LDG to location
> > > X that lead to a tag store/load to Y.
> > > 
> > > Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
> > > For example, some large allocations may not use PROT_MTE at all or only
> > > for the first and last page since initialising the tags takes time. The
> > > side-effect is that of these 3% DRAM, only part, say 1% is effectively
> > > used. Some people want the unused tag storage to be released for normal
> > > data usage (i.e. give it to the kernel page allocator).
[...]
> > So it sounds like you might want to provide that tag memory using CMA.
> > 
> > That way, only movable allocations can end up on that CMA memory area,
> > and you can allocate selected tag pages on demand (similar to the
> > alloc_contig_range() use case).
> > 
> > That also solves the issue that such tag memory must not be longterm-pinned.
> > 
> > Regarding one complication: "The kernel needs to know where to allocate
> > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > (mprotect()) and the range it is in does not support tagging.",
> > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > doesn't support tagging. You have to migrate to a !CMA page (for
> > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> 
> Okay, I now realize that this patch set effectively duplicates some CMA
> behavior using a new migrate-type.

Yes, pretty much, with some additional hooks to trigger migration. The
CMA mechanism was a great source of inspiration.

In addition, there are some races that are addressed mostly around page
migration/copying: the source page is untagged, the destination
allocated as untagged but before the copy an mprotect() makes the source
tagged (PG_mte_tagged set) and the copy_highpage() mechanism not having
anywhere to store the tags.

> Yeah, that's probably not what we want just to identify if memory is
> taggable or not.
> 
> Maybe there is a way to just keep reusing most of CMA instead.

A potential issue is that devices (mobile phones) may need a different
CMA range as well for DMA (and not necessarily in ZONE_DMA). Can
free_area[MIGRATE_CMA] handle multiple disjoint ranges? I don't see why
not as it's just a list.

We (Google and Arm) went through a few rounds of discussions and
prototyping trying to find the best approach: (1) a separate free_area[]
array in each zone (early proof of concept from Peter C and Evgenii S,
https://github.com/google/sanitizers/tree/master/mte-dynamic-carveout),
(2) a new ZONE_METADATA, (3) a separate CPU-less NUMA node just for the
tag storage, (4) a new MIGRATE_METADATA type.

We settled on the latter as it closely resembles CMA without interfering
with it. I don't remember why we did not just go for MIGRATE_CMA, it may
have been the heterogeneous memory aspect and the fact that we don't
want PROT_MTE (VM_MTE) allocations from this range. If the hardware
allowed this, I think the patches would have been a bit simpler.

Alex can comment more next week on how we ended up with this choice but
if we find a way to avoid VM_MTE allocations from certain areas, I think
we can reuse the CMA infrastructure. A bigger hammer would be no VM_MTE
allocations from any CMA range but it seems too restrictive.

> Another simpler idea to get started would be to just intercept the first
> PROT_MTE, and allocate all CMA memory. In that case, systems that don't ever
> use PROT_MTE can have that additional 3% of memory.

We had this on the table as well but the most likely deployment, at
least initially, is only some secure services enabling MTE with various
apps gradually moving towards this in time. So that's why the main
pushback from vendors is having this 3% reserved permanently. Even if
all apps use MTE, only the anonymous mappings are PROT_MTE, so still not
fully using the tag storage.
Alexandru Elisei Sept. 6, 2023, 11:23 a.m. UTC | #6
Hi,

Thank you for the feedback!

Catalin did a great job explaining what this patch series does, I'll add my
own comments on top of his.

On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > On 24.08.23 13:06, David Hildenbrand wrote:
> > > On 24.08.23 12:44, Catalin Marinas wrote:
> > > > The way MTE is implemented currently is to have a static carve-out of
> > > > the DRAM to store the allocation tags (a.k.a. memory colour). This is
> > > > what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
> > > > means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
> > > > done transparently by the hardware/interconnect (with firmware setup)
> > > > and normally hidden from the OS. So a checked memory access to location
> > > > X generates a tag fetch from location Y in the carve-out and this tag is
> > > > compared with the bits 59:56 in the pointer. The correspondence from X
> > > > to Y is linear (subject to a minimum block size to deal with some
> > > > address interleaving). The software doesn't need to know about this
> > > > correspondence as we have specific instructions like STG/LDG to location
> > > > X that lead to a tag store/load to Y.
> > > > 
> > > > Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
> > > > For example, some large allocations may not use PROT_MTE at all or only
> > > > for the first and last page since initialising the tags takes time. The
> > > > side-effect is that of these 3% DRAM, only part, say 1% is effectively
> > > > used. Some people want the unused tag storage to be released for normal
> > > > data usage (i.e. give it to the kernel page allocator).
> [...]
> > > So it sounds like you might want to provide that tag memory using CMA.
> > > 
> > > That way, only movable allocations can end up on that CMA memory area,
> > > and you can allocate selected tag pages on demand (similar to the
> > > alloc_contig_range() use case).
> > > 
> > > That also solves the issue that such tag memory must not be longterm-pinned.
> > > 
> > > Regarding one complication: "The kernel needs to know where to allocate
> > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > (mprotect()) and the range it is in does not support tagging.",
> > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > 
> > Okay, I now realize that this patch set effectively duplicates some CMA
> > behavior using a new migrate-type.
> 
> Yes, pretty much, with some additional hooks to trigger migration. The
> CMA mechanism was a great source of inspiration.
> 
> In addition, there are some races that are addressed mostly around page
> migration/copying: the source page is untagged, the destination
> allocated as untagged but before the copy an mprotect() makes the source
> tagged (PG_mte_tagged set) and the copy_highpage() mechanism not having
> anywhere to store the tags.
> 
> > Yeah, that's probably not what we want just to identify if memory is
> > taggable or not.
> > 
> > Maybe there is a way to just keep reusing most of CMA instead.
> 
> A potential issue is that devices (mobile phones) may need a different
> CMA range as well for DMA (and not necessarily in ZONE_DMA). Can
> free_area[MIGRATE_CMA] handle multiple disjoint ranges? I don't see why
> not as it's just a list.

I don't think that's a problem either, today the user can specify multiple
CMA ranges on the kernel command line (via "cma", "hugetlb_cma", etc). CMA
already has the mechanism to keep track of multiple regions - it stores in
the cma_areas array.

> 
> We (Google and Arm) went through a few rounds of discussions and
> prototyping trying to find the best approach: (1) a separate free_area[]
> array in each zone (early proof of concept from Peter C and Evgenii S,
> https://github.com/google/sanitizers/tree/master/mte-dynamic-carveout),
> (2) a new ZONE_METADATA, (3) a separate CPU-less NUMA node just for the
> tag storage, (4) a new MIGRATE_METADATA type.
> 
> We settled on the latter as it closely resembles CMA without interfering
> with it. I don't remember why we did not just go for MIGRATE_CMA, it may
> have been the heterogeneous memory aspect and the fact that we don't
> want PROT_MTE (VM_MTE) allocations from this range. If the hardware
> allowed this, I think the patches would have been a bit simpler.

You are correct, we settled on a new migrate type because the tag storage
memory is fundamentally a different memory type with different properties
than the rest of the memory in the system: tag storage memory cannot be
tagged, MIGRATE_CMA memory can be tagged.

> 
> Alex can comment more next week on how we ended up with this choice but
> if we find a way to avoid VM_MTE allocations from certain areas, I think
> we can reuse the CMA infrastructure. A bigger hammer would be no VM_MTE
> allocations from any CMA range but it seems too restrictive.

I considered mixing the tag storage memory memory with normal memory and
adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
this means that it's not enough anymore to have a __GFP_MOVABLE allocation
request to use MIGRATE_CMA.

I considered two solutions to this problem:

1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
this effectively means transforming all memory from MIGRATE_CMA into the
MIGRATE_METADATA migratetype that the series introduces. Not very
appealing, because that means treating normal memory that is also on the
MIGRATE_CMA lists as tagged memory.

2. Keep track of which pages are tag storage at page granularity (either by
a page flag, or by checking that the pfn falls in one of the tag storage
region, or by some other mechanism). When the page allocator takes free
pages from the MIGRATE_METADATA list to satisfy an allocation, compare the
gfp mask with the page type, and if the allocation is tagged and the page
is a tag storage page, put it back at the tail of the free list and choose
the next page. Repeat until the page allocator finds a normal memory page
that can be tagged (some refinements obviously needed to need to avoid
infinite loops).

I considered solution 2 to be more complicated than keeping track of tag
storage page at the migratetype level. Conceptually, keeping two distinct
memory type on separate migrate types looked to me like the cleaner and
simpler solution.

Maybe I missed something, I'm definitely open to suggestions regarding
putting the tag storage pages on MIGRATE_CMA (or another migratetype) if
that's a better approach.

Might be worth pointing out that putting the tag storage memory on the
MIGRATE_CMA migratetype only changes how the page allocator allocates
pages; all the other changes to migration/compaction/mprotect/etc will
still be there, because they are needed not because of how the tag storage
memory is represented by the page allocator, but because tag storage memory
cannot be tagged, and regular memory can.

Thanks,
Alex

> 
> > Another simpler idea to get started would be to just intercept the first
> > PROT_MTE, and allocate all CMA memory. In that case, systems that don't ever
> > use PROT_MTE can have that additional 3% of memory.
> 
> We had this on the table as well but the most likely deployment, at
> least initially, is only some secure services enabling MTE with various
> apps gradually moving towards this in time. So that's why the main
> pushback from vendors is having this 3% reserved permanently. Even if
> all apps use MTE, only the anonymous mappings are PROT_MTE, so still not
> fully using the tag storage.
> 
> -- 
> Catalin
>
Catalin Marinas Sept. 11, 2023, 11:52 a.m. UTC | #7
On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > (mprotect()) and the range it is in does not support tagging.",
> > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > 
> > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > behavior using a new migrate-type.
[...]
> I considered mixing the tag storage memory memory with normal memory and
> adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> request to use MIGRATE_CMA.
> 
> I considered two solutions to this problem:
> 
> 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> this effectively means transforming all memory from MIGRATE_CMA into the
> MIGRATE_METADATA migratetype that the series introduces. Not very
> appealing, because that means treating normal memory that is also on the
> MIGRATE_CMA lists as tagged memory.

That's indeed not ideal. We could try this if it makes the patches
significantly simpler, though I'm not so sure.

Allocating metadata is the easier part as we know the correspondence
from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
storage page), so alloc_contig_range() does this for us. Just adding it
to the CMA range is sufficient.

However, making sure that we don't allocate PROT_MTE pages from the
metadata range is what led us to another migrate type. I guess we could
achieve something similar with a new zone or a CPU-less NUMA node,
though the latter is not guaranteed not to allocate memory from the
range, only make it less likely. Both these options are less flexible in
terms of size/alignment/placement.

Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
configure the metadata range in ZONE_MOVABLE but at some point I'd
expect some CXL-attached memory to support MTE with additional carveout
reserved.

To recap, in this series, a PROT_MTE page allocation starts with a
typical allocation from anywhere other than MIGRATE_METADATA followed by
the hooks to reserve the corresponding metadata range at (pfn * 128 +
offset) for a 4K page. The whole metadata page is reserved, so the
adjacent 31 pages around the original allocation can also be mapped as
PROT_MTE.

(Peter and Evgenii @ Google had a slightly different approach in their
prototype: separate free_area[] array for PROT_MTE pages; while it has
some advantages, I found it more intrusive since the same page can be on
a free_area/free_list or another)

> 2. Keep track of which pages are tag storage at page granularity (either by
> a page flag, or by checking that the pfn falls in one of the tag storage
> region, or by some other mechanism). When the page allocator takes free
> pages from the MIGRATE_METADATA list to satisfy an allocation, compare the
> gfp mask with the page type, and if the allocation is tagged and the page
> is a tag storage page, put it back at the tail of the free list and choose
> the next page. Repeat until the page allocator finds a normal memory page
> that can be tagged (some refinements obviously needed to need to avoid
> infinite loops).

With large enough CMA areas, there's a real risk of latency spikes, RCU
stalls etc. Not really keen on such heuristics.
David Hildenbrand Sept. 11, 2023, 12:29 p.m. UTC | #8
On 11.09.23 13:52, Catalin Marinas wrote:
> On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
>> On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
>>> On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
>>>> On 24.08.23 13:06, David Hildenbrand wrote:
>>>>> Regarding one complication: "The kernel needs to know where to allocate
>>>>> a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
>>>>> (mprotect()) and the range it is in does not support tagging.",
>>>>> simplified handling would be if it's in a MIGRATE_CMA pageblock, it
>>>>> doesn't support tagging. You have to migrate to a !CMA page (for
>>>>> example, not specifying GFP_MOVABLE as a quick way to achieve that).
>>>>
>>>> Okay, I now realize that this patch set effectively duplicates some CMA
>>>> behavior using a new migrate-type.
> [...]
>> I considered mixing the tag storage memory memory with normal memory and
>> adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
>> this means that it's not enough anymore to have a __GFP_MOVABLE allocation
>> request to use MIGRATE_CMA.
>>
>> I considered two solutions to this problem:
>>
>> 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
>> this effectively means transforming all memory from MIGRATE_CMA into the
>> MIGRATE_METADATA migratetype that the series introduces. Not very
>> appealing, because that means treating normal memory that is also on the
>> MIGRATE_CMA lists as tagged memory.
> 
> That's indeed not ideal. We could try this if it makes the patches
> significantly simpler, though I'm not so sure.
> 
> Allocating metadata is the easier part as we know the correspondence
> from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> storage page), so alloc_contig_range() does this for us. Just adding it
> to the CMA range is sufficient.
> 
> However, making sure that we don't allocate PROT_MTE pages from the
> metadata range is what led us to another migrate type. I guess we could
> achieve something similar with a new zone or a CPU-less NUMA node,

Ideally, no significant core-mm changes to optimize for an architecture 
oddity. That implies, no new zones and no new migratetypes -- unless it 
is unavoidable and you are confident that you can convince core-MM 
people that the use case (giving back 3% of system RAM at max in some 
setups) is worth the trouble.

I also had CPU-less NUMA nodes in mind when thinking about that, but not 
sure how easy it would be to integrate it. If the tag memory has 
actually different performance characteristics as well, a NUMA node 
would be the right choice.

If we could find some way to easily support this either via CMA or 
CPU-less NUMA nodes, that would be much preferable; even if we cannot 
cover each and every future use case right now. I expect some issues 
with CXL+MTE either way , but are happy to be taught otherwise :)


Another thought I had was adding something like CMA memory 
characteristics. Like, asking if a given CMA area/page supports tagging 
(i.e., flag for the CMA area set?)?

When you need memory that supports tagging and have a page that does not 
support tagging (CMA && taggable), simply migrate to !MOVABLE memory 
(eventually we could also try adding !CMA).

Was that discussed and what would be the challenges with that? Page 
migration due to compaction comes to mind, but it might also be easy to 
handle if we can just avoid CMA memory for that.

> though the latter is not guaranteed not to allocate memory from the
> range, only make it less likely. Both these options are less flexible in
> terms of size/alignment/placement.
> 
> Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> configure the metadata range in ZONE_MOVABLE but at some point I'd
> expect some CXL-attached memory to support MTE with additional carveout
> reserved.

I have no idea how we could possibly cleanly support memory hotplug in 
virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast 
to s390x storage keys, the approach that arm64 with MTE took here 
(exposing tag memory to the VM) makes it rather hard and complicated.
Kuan-Ying Lee (李冠穎) Sept. 13, 2023, 8:11 a.m. UTC | #9
On Wed, 2023-08-23 at 14:13 +0100, Alexandru Elisei wrote:
> Introduction
> ============
> 
> Arm has implemented memory coloring in hardware, and the feature is
> called
> Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in
> bits
> 59..56 of a pointer, and storing this tag to a reserved memory
> location.
> When the pointer is dereferenced, the hardware compares the tag
> embedded in
> the pointer (logical tag) with the tag stored in memory (allocation
> tag).
> 
> The relation between memory and where the tag for that memory is
> stored is
> static.
> 
> The memory where the tags are stored have been so far unaccessible to
> Linux.
> This series aims to change that, by adding support for using the tag
> storage
> memory only as data memory; tag storage memory cannot be itself
> tagged.
> 
> 
> Implementation
> ==============
> 
> The series is based on v6.5-rc3 with these two patches cherry picked:
> 
> - mm: Call arch_swap_restore() from unuse_pte():
> 
>     
> https://lore.kernel.org/all/20230523004312.1807357-3-pcc@google.com/
> 
> - arm64: mte: Simplify swap tag restoration logic:
> 
>     
> https://lore.kernel.org/all/20230523004312.1807357-4-pcc@google.com/
> 
> The above two patches are queued for the v6.6 merge window:
> 
>     
> https://lore.kernel.org/all/20230702123821.04e64ea2c04dd0fdc947bda3@linux-foundation.org/
> 
> The entire series, including the above patches, can be cloned with:
> 
> $ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
> 	-b arm-mte-dynamic-carveout-rfc-v1
> 
> On the arm64 architecture side, an extension is being worked on that
> will
> clarify how MTE tag storage reuse should behave. The extension will
> be
> made public soon.
> 
> On the Linux side, MTE tag storage reuse is accomplished with the
> following changes:
> 
> 1. The tag storage memory is exposed to the memory allocator as a new
> migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA,
> with
> the restriction that it cannot be used to allocate tagged memory (tag
> storage memory cannot be tagged). On tagged page allocation, the
> corresponding tag storage is reserved via alloc_contig_range().
> 
> 2. mprotect(PROT_MTE) is implemented by changing the pte prot to
> PAGE_METADATA_NONE. When the page is next accessed, a fault is taken
> and
> the corresponding tag storage is reserved.
> 
> 3. When the code tries to copy tags to a page which doesn't have the
> tag
> storage reserved, the tags are copied to an xarray and restored in
> set_pte_at(), when the page is eventually mapped with the tag storage
> reserved.
> 
> KVM support has not been implemented yet, that because a non-MTE
> enabled VMA
> can back the memory of an MTE-enabled VM. After there is a consensus
> on the
> right approach on the memory management support, I will add it.
> 
> Explanations for the last two changes follow. The gist of it is that
> they
> were added mostly because of races, and it my intention to make the
> code
> more robust.
> 
> PAGE_METADATA_NONE was introduced to avoid races with
> mprotect(PROT_MTE).
> For example, migration can race with mprotect(PROT_MTE):
> - thread 0 initiates migration for a page in a non-MTE enabled VMA
> and a
>   destination page is allocated without tag storage.
> - thread 1 handles an mprotect(PROT_MTE), the VMA becomes tagged, and
> an
>   access turns the source page that is in the process of being
> migrated
>   into a tagged page.
> - thread 0 finishes migration and the destination page is mapped as
> tagged,
>   but without tag storage reserved.
> More details and examples can be found in the patches.
> 
> This race is also related to how tag restoring is handled when tag
> storage
> is missing: when a tagged page is swapped out, the tags are saved in
> an
> xarray indexed by swp_entry.val. When a page is swapped back in, if
> there
> are tags corresponding to the swp_entry that the page will replace,
> the
> tags are unconditionally restored, even if the page will be mapped as
> untagged. Because the page will be mapped as untagged, tag storage
> was
> not reserved when the page was allocated to replace the swp_entry
> which has
> tags associated with it.
> 
> To get around this, save the tags in a new xarray, this time indexed
> by
> pfn, and restore them when the same page is mapped as tagged.
> 
> This also solves another race, this time with copy_highpage. In the
> scenario where migration races with mprotect(PROT_MTE), before the
> page is
> mapped, the contents of the source page is copied to the destination.
> And
> this includes tags, which will be copied to a page with missing tag
> storage, which can to data corruption if the missing tag storage is
> in use
> for data. So copy_highpage() has received a similar treatment to the
> swap
> code, and the source tags are copied in the xarray indexed by the
> destination page pfn.
> 
> 
> Overview of the patches
> =======================
> 
> Patches 1-3 do some preparatory work by renaming a few functions and
> a gfp
> flag.
> 
> Patches 4-12 are arch independent and introduce MIGRATE_METADATA to
> the
> page allocator.
> 
> Patches 13-18 are arm64 specific and add support for detecting the
> tag
> storage region and onlining it with the MIGRATE_METADATA migratetype.
> 
> Patches 19-24 are arch independent and modify the page allocator to
> callback into arch dependant functions to reserve metadata storage
> for an
> allocation which requires metadata.
> 
> Patches 25-28 are mostly arm64 specific and implement the reservation
> and
> freeing of tag storage on tagged page allocation. Patch #28 ("mm:
> sched:
> Introduce PF_MEMALLOC_ISOLATE") adds a current flag,
> PF_MEMALLOC_ISOLATE,
> which ignores page isolation limits; this is used by arm64 when
> reserving
> tag storage in the same patch.
> 
> Patches 29-30 add arch independent support for doing
> mprotect(PROT_MTE)
> when metadata storage is enabled.
> 
> Patches 31-37 are mostly arm64 specific and handle the restoring of
> tags
> when tag storage is missing. The exceptions are patches 32 (adds the
> arch_swap_prepare_to_restore() function) and 35 (add
> PAGE_METADATA_NONE
> support for THPs).
> 
> Testing
> =======
> 
> To enable MTE dynamic tag storage:
> 
> - CONFIG_ARM64_MTE_TAG_STORAGE=y
> - system_supports_mte() returns true
> - kasan_hw_tags_enabled() returns false
> - correct DTB node (for the specification, see commit "arm64: mte:
> Reserve tag
>   storage memory")
> 
> Check dmesg for the message "MTE tag storage enabled" or grep for
> metadata
> in /proc/vmstat.
> 
> I've tested the series using FVP with MTE enabled, but without
> support for
> dynamic tag storage reuse. To simulate it, I've added two fake tag
> storage
> regions in the DTB by splitting a 2GB region roughly into 33 slices
> of size
> 0x3e0_0000, and using 32 of them for tagged memory and one slice for
> tag
> storage:
> 
> diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> index 60472d65a355..bd050373d6cf 100644
> --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> @@ -165,10 +165,28 @@ C1_L2: l2-cache1 {
>                 };
>         };
>  
> -       memory@80000000 {
> +       memory0: memory@80000000 {
>                 device_type = "memory";
> -               reg = <0x00000000 0x80000000 0 0x80000000>,
> -                     <0x00000008 0x80000000 0 0x80000000>;
> +               reg = <0x00 0x80000000 0x00 0x7c000000>;
> +       };
> +
> +       metadata0: metadata@c0000000  {
> +               compatible = "arm,mte-tag-storage";
> +               reg = <0x00 0xfc000000 0x00 0x3e00000>;
> +               block-size = <0x1000>;
> +               memory = <&memory0>;
> +       };
> +
> +       memory1: memory@880000000 {
> +               device_type = "memory";
> +               reg = <0x08 0x80000000 0x00 0x7c000000>;
> +       };
> +
> +       metadata1: metadata@8c0000000  {
> +               compatible = "arm,mte-tag-storage";
> +               reg = <0x08 0xfc000000 0x00 0x3e00000>;
> +               block-size = <0x1000>;
> +               memory = <&memory1>;
>         };
>  

Hi Alexandru,

AFAIK, the above memory configuration means that there are two region
of dram(0x80000000-0xfc000000 and 0x8_80000000-0x8_fc0000000) and this
is called PDD memory map.

Document[1] said there are some constraints of tag memory as below.

| The following constraints apply to the tag regions in DRAM:
| 1. The tag region cannot be interleaved with the data region.
| The tag region must also be above the data region within DRAM.
|
| 2.The tag region in the physical address space cannot straddle
| multiple regions of a memory map.
|
| PDD memory map is not allowed to have part of the tag region between
| 2GB-4GB and another part between 34GB-64GB.


I'm not sure if we can separate tag memory with the above
configuration. Or do I miss something?

[1] https://developer.arm.com/documentation/101569/0300/?lang=en
(Section 5.4.6.1)

Thanks,
Kuan-Ying Lee
>         reserved-memory {
> 
> 
> Alexandru Elisei (37):
>   mm: page_alloc: Rename gfp_to_alloc_flags_cma ->
>     gfp_to_alloc_flags_fast
>   arm64: mte: Rework naming for tag manipulation functions
>   arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
>   mm: Add MIGRATE_METADATA allocation policy
>   mm: Add memory statistics for the MIGRATE_METADATA allocation
> policy
>   mm: page_alloc: Allocate from movable pcp lists only if
>     ALLOC_FROM_METADATA
>   mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages
>   mm: compaction: Account for free metadata pages in
>     __compact_finished()
>   mm: compaction: Handle metadata pages as source for direct
> compaction
>   mm: compaction: Do not use MIGRATE_METADATA to replace pages with
>     metadata
>   mm: migrate/mempolicy: Allocate metadata-enabled destination page
>   mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages
>   arm64: mte: Reserve tag storage memory
>   arm64: mte: Expose tag storage pages to the MIGRATE_METADATA
> freelist
>   arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
>   arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is
> disabled
>   arm64: mte: Disable dynamic tag storage management if HW KASAN is
>     enabled
>   arm64: mte: Check that tag storage blocks are in the same zone
>   mm: page_alloc: Manage metadata storage on page allocation
>   mm: compaction: Reserve metadata storage in compaction_alloc()
>   mm: khugepaged: Handle metadata-enabled VMAs
>   mm: shmem: Allocate metadata storage for in-memory filesystems
>   mm: Teach vma_alloc_folio() about metadata-enabled VMAs
>   mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA
>   arm64: mte: Manage tag storage on page allocation
>   arm64: mte: Perform CMOs for tag blocks on tagged page
> allocation/free
>   arm64: mte: Reserve tag block for the zero page
>   mm: sched: Introduce PF_MEMALLOC_ISOLATE
>   mm: arm64: Define the PAGE_METADATA_NONE page protection
>   mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)
>   mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing
> metadata
>     storage
>   mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
>   arm64: mte: swap/copypage: Handle tag restoring when missing tag
>     storage
>   arm64: mte: Handle fatal signal in reserve_metadata_storage()
>   mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages
>   KVM: arm64: Disable MTE is tag storage is enabled
>   arm64: mte: Enable tag storage management
> 
>  arch/arm64/Kconfig                       |  13 +
>  arch/arm64/include/asm/assembler.h       |  10 +
>  arch/arm64/include/asm/memory_metadata.h |  49 ++
>  arch/arm64/include/asm/mte-def.h         |  16 +-
>  arch/arm64/include/asm/mte.h             |  40 +-
>  arch/arm64/include/asm/mte_tag_storage.h |  36 ++
>  arch/arm64/include/asm/page.h            |   5 +-
>  arch/arm64/include/asm/pgtable-prot.h    |   2 +
>  arch/arm64/include/asm/pgtable.h         |  33 +-
>  arch/arm64/kernel/Makefile               |   1 +
>  arch/arm64/kernel/elfcore.c              |  14 +-
>  arch/arm64/kernel/hibernate.c            |  46 +-
>  arch/arm64/kernel/mte.c                  |  31 +-
>  arch/arm64/kernel/mte_tag_storage.c      | 667
> +++++++++++++++++++++++
>  arch/arm64/kernel/setup.c                |   7 +
>  arch/arm64/kvm/arm.c                     |   6 +-
>  arch/arm64/lib/mte.S                     |  30 +-
>  arch/arm64/mm/copypage.c                 |  26 +
>  arch/arm64/mm/fault.c                    |  35 +-
>  arch/arm64/mm/mteswap.c                  | 113 +++-
>  fs/proc/meminfo.c                        |   8 +
>  fs/proc/page.c                           |   1 +
>  include/asm-generic/Kbuild               |   1 +
>  include/asm-generic/memory_metadata.h    |  50 ++
>  include/linux/gfp.h                      |  10 +
>  include/linux/gfp_types.h                |  14 +-
>  include/linux/huge_mm.h                  |   6 +
>  include/linux/kernel-page-flags.h        |   1 +
>  include/linux/migrate_mode.h             |   1 +
>  include/linux/mm.h                       |  12 +-
>  include/linux/mmzone.h                   |  26 +-
>  include/linux/page-flags.h               |   1 +
>  include/linux/pgtable.h                  |  19 +
>  include/linux/sched.h                    |   2 +-
>  include/linux/sched/mm.h                 |  13 +
>  include/linux/vm_event_item.h            |   5 +
>  include/linux/vmstat.h                   |   2 +
>  include/trace/events/mmflags.h           |   5 +-
>  mm/Kconfig                               |   5 +
>  mm/compaction.c                          |  52 +-
>  mm/huge_memory.c                         | 109 ++++
>  mm/internal.h                            |   7 +
>  mm/khugepaged.c                          |   7 +
>  mm/memory.c                              | 180 +++++-
>  mm/mempolicy.c                           |   7 +
>  mm/migrate.c                             |   6 +
>  mm/mm_init.c                             |  23 +-
>  mm/mprotect.c                            |  46 ++
>  mm/page_alloc.c                          | 136 ++++-
>  mm/page_isolation.c                      |  19 +-
>  mm/page_owner.c                          |   3 +-
>  mm/shmem.c                               |  14 +-
>  mm/show_mem.c                            |   4 +
>  mm/swapfile.c                            |   4 +
>  mm/vmscan.c                              |   3 +
>  mm/vmstat.c                              |  13 +-
>  56 files changed, 1834 insertions(+), 161 deletions(-)
>  create mode 100644 arch/arm64/include/asm/memory_metadata.h
>  create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
>  create mode 100644 arch/arm64/kernel/mte_tag_storage.c
>  create mode 100644 include/asm-generic/memory_metadata.h
>
Catalin Marinas Sept. 13, 2023, 3:29 p.m. UTC | #10
On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> On 11.09.23 13:52, Catalin Marinas wrote:
> > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > 
> > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > behavior using a new migrate-type.
> > [...]
> > > I considered mixing the tag storage memory memory with normal memory and
> > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > request to use MIGRATE_CMA.
> > > 
> > > I considered two solutions to this problem:
> > > 
> > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > appealing, because that means treating normal memory that is also on the
> > > MIGRATE_CMA lists as tagged memory.
> > 
> > That's indeed not ideal. We could try this if it makes the patches
> > significantly simpler, though I'm not so sure.
> > 
> > Allocating metadata is the easier part as we know the correspondence
> > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > storage page), so alloc_contig_range() does this for us. Just adding it
> > to the CMA range is sufficient.
> > 
> > However, making sure that we don't allocate PROT_MTE pages from the
> > metadata range is what led us to another migrate type. I guess we could
> > achieve something similar with a new zone or a CPU-less NUMA node,
> 
> Ideally, no significant core-mm changes to optimize for an architecture
> oddity. That implies, no new zones and no new migratetypes -- unless it is
> unavoidable and you are confident that you can convince core-MM people that
> the use case (giving back 3% of system RAM at max in some setups) is worth
> the trouble.

If I was an mm maintainer, I'd also question this ;). But vendors seem
pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
16G platform does look somewhat big). As more and more apps adopt MTE,
the wastage would be smaller but the first step is getting vendors to
enable it.

> I also had CPU-less NUMA nodes in mind when thinking about that, but not
> sure how easy it would be to integrate it. If the tag memory has actually
> different performance characteristics as well, a NUMA node would be the
> right choice.

In general I'd expect the same characteristics. However, changing the
memory designation from tag to data (and vice-versa) requires some cache
maintenance. The allocation cost is slightly higher (not the runtime
one), so it would help if the page allocator does not favour this range.
Anyway, that's an optimisation to worry about later.

> If we could find some way to easily support this either via CMA or CPU-less
> NUMA nodes, that would be much preferable; even if we cannot cover each and
> every future use case right now. I expect some issues with CXL+MTE either
> way , but are happy to be taught otherwise :)

I think CXL+MTE is rather theoretical at the moment. Given that PCIe
doesn't have any notion of MTE, more likely there would be some piece of
interconnect that generates two memory accesses: one for data and the
other for tags at a configurable offset (which may or may not be in the
same CXL range).

> Another thought I had was adding something like CMA memory characteristics.
> Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> CMA area set?)?

I don't think adding CMA memory characteristics helps much. The metadata
allocation wouldn't go through cma_alloc() but rather
alloc_contig_range() directly for a specific pfn corresponding to the
data pages with PROT_MTE. The core mm code doesn't need to know about
the tag storage layout.

It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
That's typically coming from device drivers (DMA API) with their own
mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
therefore PROT_MTE is rejected).

What we need though is to prevent vma_alloc_folio() from allocating from
a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
removing __GFP_MOVABLE in those cases. As long as we don't have large
ZONE_MOVABLE areas, it shouldn't be an issue.

> When you need memory that supports tagging and have a page that does not
> support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> (eventually we could also try adding !CMA).
> 
> Was that discussed and what would be the challenges with that? Page
> migration due to compaction comes to mind, but it might also be easy to
> handle if we can just avoid CMA memory for that.

IIRC that was because PROT_MTE pages would have to come only from
!MOVABLE ranges. Maybe that's not such big deal.

We'll give this a go and hopefully it simplifies the patches a bit (it
will take a while as Alex keeps going on holiday ;)). In the meantime,
I'm talking to the hardware people to see whether we can have MTE pages
in the tag storage/metadata range. We'd still need to reserve about 0.1%
of the RAM for the metadata corresponding to the tag storage range when
used as data but that's negligible (1/32 of 1/32). So if some future
hardware allows this, we can drop the page allocation restriction from
the CMA range.

> > though the latter is not guaranteed not to allocate memory from the
> > range, only make it less likely. Both these options are less flexible in
> > terms of size/alignment/placement.
> > 
> > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > expect some CXL-attached memory to support MTE with additional carveout
> > reserved.
> 
> I have no idea how we could possibly cleanly support memory hotplug in
> virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> memory to the VM) makes it rather hard and complicated.

The current thinking is that the VM is not aware of the tag storage,
that's entirely managed by the host. The host would treat the guest
memory similarly to the PROT_MTE user allocations, reserve metadata etc.

Thanks for the feedback so far, very useful.
Catalin Marinas Sept. 14, 2023, 5:37 p.m. UTC | #11
Hi Kuan-Ying,

On Wed, Sep 13, 2023 at 08:11:40AM +0000, Kuan-Ying Lee (李冠穎) wrote:
> On Wed, 2023-08-23 at 14:13 +0100, Alexandru Elisei wrote:
> > diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > index 60472d65a355..bd050373d6cf 100644
> > --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > @@ -165,10 +165,28 @@ C1_L2: l2-cache1 {
> >                 };
> >         };
> >  
> > -       memory@80000000 {
> > +       memory0: memory@80000000 {
> >                 device_type = "memory";
> > -               reg = <0x00000000 0x80000000 0 0x80000000>,
> > -                     <0x00000008 0x80000000 0 0x80000000>;
> > +               reg = <0x00 0x80000000 0x00 0x7c000000>;
> > +       };
> > +
> > +       metadata0: metadata@c0000000  {
> > +               compatible = "arm,mte-tag-storage";
> > +               reg = <0x00 0xfc000000 0x00 0x3e00000>;
> > +               block-size = <0x1000>;
> > +               memory = <&memory0>;
> > +       };
> > +
> > +       memory1: memory@880000000 {
> > +               device_type = "memory";
> > +               reg = <0x08 0x80000000 0x00 0x7c000000>;
> > +       };
> > +
> > +       metadata1: metadata@8c0000000  {
> > +               compatible = "arm,mte-tag-storage";
> > +               reg = <0x08 0xfc000000 0x00 0x3e00000>;
> > +               block-size = <0x1000>;
> > +               memory = <&memory1>;
> >         };
> >  
> 
> AFAIK, the above memory configuration means that there are two region
> of dram(0x80000000-0xfc000000 and 0x8_80000000-0x8_fc0000000) and this
> is called PDD memory map.
> 
> Document[1] said there are some constraints of tag memory as below.
> 
> | The following constraints apply to the tag regions in DRAM:
> | 1. The tag region cannot be interleaved with the data region.
> | The tag region must also be above the data region within DRAM.
> |
> | 2.The tag region in the physical address space cannot straddle
> | multiple regions of a memory map.
> |
> | PDD memory map is not allowed to have part of the tag region between
> | 2GB-4GB and another part between 34GB-64GB.
> 
> I'm not sure if we can separate tag memory with the above
> configuration. Or do I miss something?
> 
> [1] https://developer.arm.com/documentation/101569/0300/?lang=en
> (Section 5.4.6.1)

Good point, thanks. The above dts some random layout we picked as an
example, it doesn't match any real hardware and we didn't pay attention
to the interconnect limitations (we fake the tag storage on the model).

I'll try to dig out how the mtu_tag_addr_shutter registers work and how
the sparse DRAM space is compressed to a smaller tag range. But that's
something done by firmware and the kernel only learns the tag storage
location from the DT (provided by firmware). We also don't need to know
the fine-grained mapping between 32 bytes of data and 1 byte (2 tags) in
the tag storage, only the block size in the tag storage space that
covers all interleaving done by the interconnect (it can be from 1 byte
to something larger like a page; the kernel will then use the lowest
common multiple between a page size and this tag block size to figure
out how many pages to reserve).
Hyesoo Yu Oct. 25, 2023, 2:59 a.m. UTC | #12
On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote:
> On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> > On 11.09.23 13:52, Catalin Marinas wrote:
> > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > > 
> > > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > > behavior using a new migrate-type.
> > > [...]
> > > > I considered mixing the tag storage memory memory with normal memory and
> > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > > request to use MIGRATE_CMA.
> > > > 
> > > > I considered two solutions to this problem:
> > > > 
> > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > > appealing, because that means treating normal memory that is also on the
> > > > MIGRATE_CMA lists as tagged memory.
> > > 
> > > That's indeed not ideal. We could try this if it makes the patches
> > > significantly simpler, though I'm not so sure.
> > > 
> > > Allocating metadata is the easier part as we know the correspondence
> > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > > storage page), so alloc_contig_range() does this for us. Just adding it
> > > to the CMA range is sufficient.
> > > 
> > > However, making sure that we don't allocate PROT_MTE pages from the
> > > metadata range is what led us to another migrate type. I guess we could
> > > achieve something similar with a new zone or a CPU-less NUMA node,
> > 
> > Ideally, no significant core-mm changes to optimize for an architecture
> > oddity. That implies, no new zones and no new migratetypes -- unless it is
> > unavoidable and you are confident that you can convince core-MM people that
> > the use case (giving back 3% of system RAM at max in some setups) is worth
> > the trouble.
> 
> If I was an mm maintainer, I'd also question this ;). But vendors seem
> pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
> 16G platform does look somewhat big). As more and more apps adopt MTE,
> the wastage would be smaller but the first step is getting vendors to
> enable it.
> 
> > I also had CPU-less NUMA nodes in mind when thinking about that, but not
> > sure how easy it would be to integrate it. If the tag memory has actually
> > different performance characteristics as well, a NUMA node would be the
> > right choice.
> 
> In general I'd expect the same characteristics. However, changing the
> memory designation from tag to data (and vice-versa) requires some cache
> maintenance. The allocation cost is slightly higher (not the runtime
> one), so it would help if the page allocator does not favour this range.
> Anyway, that's an optimisation to worry about later.
> 
> > If we could find some way to easily support this either via CMA or CPU-less
> > NUMA nodes, that would be much preferable; even if we cannot cover each and
> > every future use case right now. I expect some issues with CXL+MTE either
> > way , but are happy to be taught otherwise :)
> 
> I think CXL+MTE is rather theoretical at the moment. Given that PCIe
> doesn't have any notion of MTE, more likely there would be some piece of
> interconnect that generates two memory accesses: one for data and the
> other for tags at a configurable offset (which may or may not be in the
> same CXL range).
> 
> > Another thought I had was adding something like CMA memory characteristics.
> > Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> > CMA area set?)?
> 
> I don't think adding CMA memory characteristics helps much. The metadata
> allocation wouldn't go through cma_alloc() but rather
> alloc_contig_range() directly for a specific pfn corresponding to the
> data pages with PROT_MTE. The core mm code doesn't need to know about
> the tag storage layout.
> 
> It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
> That's typically coming from device drivers (DMA API) with their own
> mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
> therefore PROT_MTE is rejected).
> 
> What we need though is to prevent vma_alloc_folio() from allocating from
> a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
> removing __GFP_MOVABLE in those cases. As long as we don't have large
> ZONE_MOVABLE areas, it shouldn't be an issue.
> 

How about unsetting ALLOC_CMA if GFP_TAGGED ?
Removing __GFP_MOVABLE may cause movable pages to be allocated in un
unmovable migratetype, which may not be desirable for page fragmentation.

> > When you need memory that supports tagging and have a page that does not
> > support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> > (eventually we could also try adding !CMA).
> > 
> > Was that discussed and what would be the challenges with that? Page
> > migration due to compaction comes to mind, but it might also be easy to
> > handle if we can just avoid CMA memory for that.
> 
> IIRC that was because PROT_MTE pages would have to come only from
> !MOVABLE ranges. Maybe that's not such big deal.
> 

Could you explain what it means that PROT_MTE have to come only from
!MOVABLE range ? I don't understand this part very well.

Thanks,
Hyesoo.

> We'll give this a go and hopefully it simplifies the patches a bit (it
> will take a while as Alex keeps going on holiday ;)). In the meantime,
> I'm talking to the hardware people to see whether we can have MTE pages
> in the tag storage/metadata range. We'd still need to reserve about 0.1%
> of the RAM for the metadata corresponding to the tag storage range when
> used as data but that's negligible (1/32 of 1/32). So if some future
> hardware allows this, we can drop the page allocation restriction from
> the CMA range.
> 
> > > though the latter is not guaranteed not to allocate memory from the
> > > range, only make it less likely. Both these options are less flexible in
> > > terms of size/alignment/placement.
> > > 
> > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > > expect some CXL-attached memory to support MTE with additional carveout
> > > reserved.
> > 
> > I have no idea how we could possibly cleanly support memory hotplug in
> > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> > s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> > memory to the VM) makes it rather hard and complicated.
> 
> The current thinking is that the VM is not aware of the tag storage,
> that's entirely managed by the host. The host would treat the guest
> memory similarly to the PROT_MTE user allocations, reserve metadata etc.
> 
> Thanks for the feedback so far, very useful.
> 
> -- 
> Catalin
>
Alexandru Elisei Oct. 25, 2023, 8:47 a.m. UTC | #13
Hi,

On Wed, Oct 25, 2023 at 11:59:32AM +0900, Hyesoo Yu wrote:
> On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote:
> > On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> > > On 11.09.23 13:52, Catalin Marinas wrote:
> > > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > > > 
> > > > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > > > behavior using a new migrate-type.
> > > > [...]
> > > > > I considered mixing the tag storage memory memory with normal memory and
> > > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > > > request to use MIGRATE_CMA.
> > > > > 
> > > > > I considered two solutions to this problem:
> > > > > 
> > > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > > > appealing, because that means treating normal memory that is also on the
> > > > > MIGRATE_CMA lists as tagged memory.
> > > > 
> > > > That's indeed not ideal. We could try this if it makes the patches
> > > > significantly simpler, though I'm not so sure.
> > > > 
> > > > Allocating metadata is the easier part as we know the correspondence
> > > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > > > storage page), so alloc_contig_range() does this for us. Just adding it
> > > > to the CMA range is sufficient.
> > > > 
> > > > However, making sure that we don't allocate PROT_MTE pages from the
> > > > metadata range is what led us to another migrate type. I guess we could
> > > > achieve something similar with a new zone or a CPU-less NUMA node,
> > > 
> > > Ideally, no significant core-mm changes to optimize for an architecture
> > > oddity. That implies, no new zones and no new migratetypes -- unless it is
> > > unavoidable and you are confident that you can convince core-MM people that
> > > the use case (giving back 3% of system RAM at max in some setups) is worth
> > > the trouble.
> > 
> > If I was an mm maintainer, I'd also question this ;). But vendors seem
> > pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
> > 16G platform does look somewhat big). As more and more apps adopt MTE,
> > the wastage would be smaller but the first step is getting vendors to
> > enable it.
> > 
> > > I also had CPU-less NUMA nodes in mind when thinking about that, but not
> > > sure how easy it would be to integrate it. If the tag memory has actually
> > > different performance characteristics as well, a NUMA node would be the
> > > right choice.
> > 
> > In general I'd expect the same characteristics. However, changing the
> > memory designation from tag to data (and vice-versa) requires some cache
> > maintenance. The allocation cost is slightly higher (not the runtime
> > one), so it would help if the page allocator does not favour this range.
> > Anyway, that's an optimisation to worry about later.
> > 
> > > If we could find some way to easily support this either via CMA or CPU-less
> > > NUMA nodes, that would be much preferable; even if we cannot cover each and
> > > every future use case right now. I expect some issues with CXL+MTE either
> > > way , but are happy to be taught otherwise :)
> > 
> > I think CXL+MTE is rather theoretical at the moment. Given that PCIe
> > doesn't have any notion of MTE, more likely there would be some piece of
> > interconnect that generates two memory accesses: one for data and the
> > other for tags at a configurable offset (which may or may not be in the
> > same CXL range).
> > 
> > > Another thought I had was adding something like CMA memory characteristics.
> > > Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> > > CMA area set?)?
> > 
> > I don't think adding CMA memory characteristics helps much. The metadata
> > allocation wouldn't go through cma_alloc() but rather
> > alloc_contig_range() directly for a specific pfn corresponding to the
> > data pages with PROT_MTE. The core mm code doesn't need to know about
> > the tag storage layout.
> > 
> > It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
> > That's typically coming from device drivers (DMA API) with their own
> > mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
> > therefore PROT_MTE is rejected).
> > 
> > What we need though is to prevent vma_alloc_folio() from allocating from
> > a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
> > removing __GFP_MOVABLE in those cases. As long as we don't have large
> > ZONE_MOVABLE areas, it shouldn't be an issue.
> > 
> 
> How about unsetting ALLOC_CMA if GFP_TAGGED ?
> Removing __GFP_MOVABLE may cause movable pages to be allocated in un
> unmovable migratetype, which may not be desirable for page fragmentation.

Yes, not setting ALLOC_CMA in alloc_flags if __GFP_TAGGED is what I am
intending to do.

> 
> > > When you need memory that supports tagging and have a page that does not
> > > support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> > > (eventually we could also try adding !CMA).
> > > 
> > > Was that discussed and what would be the challenges with that? Page
> > > migration due to compaction comes to mind, but it might also be easy to
> > > handle if we can just avoid CMA memory for that.
> > 
> > IIRC that was because PROT_MTE pages would have to come only from
> > !MOVABLE ranges. Maybe that's not such big deal.
> > 
> 
> Could you explain what it means that PROT_MTE have to come only from
> !MOVABLE range ? I don't understand this part very well.

I believe that was with the old approach, where tag storage cannot be tagged.

I'm guessing that the idea was that during migration of a tagged page, to make
sure that the destination page is not a tag storage page (which cannot be
tagged), the gfp flags used for allocating the destination page would be set
without __GFP_MOVABLE, which ensures that the destination page is not
allocated from MIGRATE_CMA. But that is not needed anymore, if we don't set
ALLOC_CMA if __GFP_TAGGED.

Thanks,
Alex

> 
> Thanks,
> Hyesoo.
> 
> > We'll give this a go and hopefully it simplifies the patches a bit (it
> > will take a while as Alex keeps going on holiday ;)). In the meantime,
> > I'm talking to the hardware people to see whether we can have MTE pages
> > in the tag storage/metadata range. We'd still need to reserve about 0.1%
> > of the RAM for the metadata corresponding to the tag storage range when
> > used as data but that's negligible (1/32 of 1/32). So if some future
> > hardware allows this, we can drop the page allocation restriction from
> > the CMA range.
> > 
> > > > though the latter is not guaranteed not to allocate memory from the
> > > > range, only make it less likely. Both these options are less flexible in
> > > > terms of size/alignment/placement.
> > > > 
> > > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > > > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > > > expect some CXL-attached memory to support MTE with additional carveout
> > > > reserved.
> > > 
> > > I have no idea how we could possibly cleanly support memory hotplug in
> > > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> > > s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> > > memory to the VM) makes it rather hard and complicated.
> > 
> > The current thinking is that the VM is not aware of the tag storage,
> > that's entirely managed by the host. The host would treat the guest
> > memory similarly to the PROT_MTE user allocations, reserve metadata etc.
> > 
> > Thanks for the feedback so far, very useful.
> > 
> > -- 
> > Catalin
> >
Hyesoo Yu Oct. 25, 2023, 8:52 a.m. UTC | #14
On Wed, Oct 25, 2023 at 09:47:36AM +0100, Alexandru Elisei wrote:
> Hi,
> 
> On Wed, Oct 25, 2023 at 11:59:32AM +0900, Hyesoo Yu wrote:
> > On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote:
> > > On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> > > > On 11.09.23 13:52, Catalin Marinas wrote:
> > > > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > > > > 
> > > > > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > > > > behavior using a new migrate-type.
> > > > > [...]
> > > > > > I considered mixing the tag storage memory memory with normal memory and
> > > > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > > > > request to use MIGRATE_CMA.
> > > > > > 
> > > > > > I considered two solutions to this problem:
> > > > > > 
> > > > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > > > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > > > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > > > > appealing, because that means treating normal memory that is also on the
> > > > > > MIGRATE_CMA lists as tagged memory.
> > > > > 
> > > > > That's indeed not ideal. We could try this if it makes the patches
> > > > > significantly simpler, though I'm not so sure.
> > > > > 
> > > > > Allocating metadata is the easier part as we know the correspondence
> > > > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > > > > storage page), so alloc_contig_range() does this for us. Just adding it
> > > > > to the CMA range is sufficient.
> > > > > 
> > > > > However, making sure that we don't allocate PROT_MTE pages from the
> > > > > metadata range is what led us to another migrate type. I guess we could
> > > > > achieve something similar with a new zone or a CPU-less NUMA node,
> > > > 
> > > > Ideally, no significant core-mm changes to optimize for an architecture
> > > > oddity. That implies, no new zones and no new migratetypes -- unless it is
> > > > unavoidable and you are confident that you can convince core-MM people that
> > > > the use case (giving back 3% of system RAM at max in some setups) is worth
> > > > the trouble.
> > > 
> > > If I was an mm maintainer, I'd also question this ;). But vendors seem
> > > pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
> > > 16G platform does look somewhat big). As more and more apps adopt MTE,
> > > the wastage would be smaller but the first step is getting vendors to
> > > enable it.
> > > 
> > > > I also had CPU-less NUMA nodes in mind when thinking about that, but not
> > > > sure how easy it would be to integrate it. If the tag memory has actually
> > > > different performance characteristics as well, a NUMA node would be the
> > > > right choice.
> > > 
> > > In general I'd expect the same characteristics. However, changing the
> > > memory designation from tag to data (and vice-versa) requires some cache
> > > maintenance. The allocation cost is slightly higher (not the runtime
> > > one), so it would help if the page allocator does not favour this range.
> > > Anyway, that's an optimisation to worry about later.
> > > 
> > > > If we could find some way to easily support this either via CMA or CPU-less
> > > > NUMA nodes, that would be much preferable; even if we cannot cover each and
> > > > every future use case right now. I expect some issues with CXL+MTE either
> > > > way , but are happy to be taught otherwise :)
> > > 
> > > I think CXL+MTE is rather theoretical at the moment. Given that PCIe
> > > doesn't have any notion of MTE, more likely there would be some piece of
> > > interconnect that generates two memory accesses: one for data and the
> > > other for tags at a configurable offset (which may or may not be in the
> > > same CXL range).
> > > 
> > > > Another thought I had was adding something like CMA memory characteristics.
> > > > Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> > > > CMA area set?)?
> > > 
> > > I don't think adding CMA memory characteristics helps much. The metadata
> > > allocation wouldn't go through cma_alloc() but rather
> > > alloc_contig_range() directly for a specific pfn corresponding to the
> > > data pages with PROT_MTE. The core mm code doesn't need to know about
> > > the tag storage layout.
> > > 
> > > It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
> > > That's typically coming from device drivers (DMA API) with their own
> > > mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
> > > therefore PROT_MTE is rejected).
> > > 
> > > What we need though is to prevent vma_alloc_folio() from allocating from
> > > a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
> > > removing __GFP_MOVABLE in those cases. As long as we don't have large
> > > ZONE_MOVABLE areas, it shouldn't be an issue.
> > > 
> > 
> > How about unsetting ALLOC_CMA if GFP_TAGGED ?
> > Removing __GFP_MOVABLE may cause movable pages to be allocated in un
> > unmovable migratetype, which may not be desirable for page fragmentation.
> 
> Yes, not setting ALLOC_CMA in alloc_flags if __GFP_TAGGED is what I am
> intending to do.
> 
> > 
> > > > When you need memory that supports tagging and have a page that does not
> > > > support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> > > > (eventually we could also try adding !CMA).
> > > > 
> > > > Was that discussed and what would be the challenges with that? Page
> > > > migration due to compaction comes to mind, but it might also be easy to
> > > > handle if we can just avoid CMA memory for that.
> > > 
> > > IIRC that was because PROT_MTE pages would have to come only from
> > > !MOVABLE ranges. Maybe that's not such big deal.
> > > 
> > 
> > Could you explain what it means that PROT_MTE have to come only from
> > !MOVABLE range ? I don't understand this part very well.
> 
> I believe that was with the old approach, where tag storage cannot be tagged.
> 
> I'm guessing that the idea was that during migration of a tagged page, to make
> sure that the destination page is not a tag storage page (which cannot be
> tagged), the gfp flags used for allocating the destination page would be set
> without __GFP_MOVABLE, which ensures that the destination page is not
> allocated from MIGRATE_CMA. But that is not needed anymore, if we don't set
> ALLOC_CMA if __GFP_TAGGED.
> 
> Thanks,
> Alex
> 

Hello, Alex.

If we only avoid using ALLOC_CMA for __GFP_TAGGED, would we still be able to use
the next iteration even if the hardware does not support "tag of tag" ? 
I am not sure every vendor will support tag of tag, since there is no information
related to that feature, like in the Google spec document.

we are also looking into this.

Thanks,
Regards.

> > 
> > Thanks,
> > Hyesoo.
> > 
> > > We'll give this a go and hopefully it simplifies the patches a bit (it
> > > will take a while as Alex keeps going on holiday ;)). In the meantime,
> > > I'm talking to the hardware people to see whether we can have MTE pages
> > > in the tag storage/metadata range. We'd still need to reserve about 0.1%
> > > of the RAM for the metadata corresponding to the tag storage range when
> > > used as data but that's negligible (1/32 of 1/32). So if some future
> > > hardware allows this, we can drop the page allocation restriction from
> > > the CMA range.
> > > 
> > > > > though the latter is not guaranteed not to allocate memory from the
> > > > > range, only make it less likely. Both these options are less flexible in
> > > > > terms of size/alignment/placement.
> > > > > 
> > > > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > > > > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > > > > expect some CXL-attached memory to support MTE with additional carveout
> > > > > reserved.
> > > > 
> > > > I have no idea how we could possibly cleanly support memory hotplug in
> > > > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> > > > s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> > > > memory to the VM) makes it rather hard and complicated.
> > > 
> > > The current thinking is that the VM is not aware of the tag storage,
> > > that's entirely managed by the host. The host would treat the guest
> > > memory similarly to the PROT_MTE user allocations, reserve metadata etc.
> > > 
> > > Thanks for the feedback so far, very useful.
> > > 
> > > -- 
> > > Catalin
> > > 
> 
> 
>
Catalin Marinas Oct. 27, 2023, 11:04 a.m. UTC | #15
On Wed, Oct 25, 2023 at 05:52:58PM +0900, Hyesoo Yu wrote:
> If we only avoid using ALLOC_CMA for __GFP_TAGGED, would we still be able to use
> the next iteration even if the hardware does not support "tag of tag" ? 

It depends on how the next iteration looks like. The plan was not to
support this so that we avoid another complication where a non-tagged
page is mprotect'ed to become tagged and it would need to be migrated
out of the CMA range. Not sure how much code it would save.

> I am not sure every vendor will support tag of tag, since there is no information
> related to that feature, like in the Google spec document.

If you are aware of any vendors not supporting this, please direct them
to the Arm support team, it would be very useful information for us.

Thanks.
diff mbox

Patch

diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
index 60472d65a355..bd050373d6cf 100644
--- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
+++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
@@ -165,10 +165,28 @@  C1_L2: l2-cache1 {
                };
        };
 
-       memory@80000000 {
+       memory0: memory@80000000 {
                device_type = "memory";
-               reg = <0x00000000 0x80000000 0 0x80000000>,
-                     <0x00000008 0x80000000 0 0x80000000>;
+               reg = <0x00 0x80000000 0x00 0x7c000000>;
+       };
+
+       metadata0: metadata@c0000000  {
+               compatible = "arm,mte-tag-storage";
+               reg = <0x00 0xfc000000 0x00 0x3e00000>;
+               block-size = <0x1000>;
+               memory = <&memory0>;
+       };
+
+       memory1: memory@880000000 {
+               device_type = "memory";
+               reg = <0x08 0x80000000 0x00 0x7c000000>;
+       };
+
+       metadata1: metadata@8c0000000  {
+               compatible = "arm,mte-tag-storage";
+               reg = <0x08 0xfc000000 0x00 0x3e00000>;
+               block-size = <0x1000>;
+               memory = <&memory1>;
        };
 
        reserved-memory {