Message ID | 20230823131350.114942-1-alexandru.elisei@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 23.08.23 15:13, Alexandru Elisei wrote: > Introduction > ============ > > Arm has implemented memory coloring in hardware, and the feature is called > Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in bits > 59..56 of a pointer, and storing this tag to a reserved memory location. > When the pointer is dereferenced, the hardware compares the tag embedded in > the pointer (logical tag) with the tag stored in memory (allocation tag). > > The relation between memory and where the tag for that memory is stored is > static. > > The memory where the tags are stored have been so far unaccessible to Linux. > This series aims to change that, by adding support for using the tag storage > memory only as data memory; tag storage memory cannot be itself tagged. > > > Implementation > ============== > > The series is based on v6.5-rc3 with these two patches cherry picked: > > - mm: Call arch_swap_restore() from unuse_pte(): > > https://lore.kernel.org/all/20230523004312.1807357-3-pcc@google.com/ > > - arm64: mte: Simplify swap tag restoration logic: > > https://lore.kernel.org/all/20230523004312.1807357-4-pcc@google.com/ > > The above two patches are queued for the v6.6 merge window: > > https://lore.kernel.org/all/20230702123821.04e64ea2c04dd0fdc947bda3@linux-foundation.org/ > > The entire series, including the above patches, can be cloned with: > > $ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \ > -b arm-mte-dynamic-carveout-rfc-v1 > > On the arm64 architecture side, an extension is being worked on that will > clarify how MTE tag storage reuse should behave. The extension will be > made public soon. > > On the Linux side, MTE tag storage reuse is accomplished with the > following changes: > > 1. The tag storage memory is exposed to the memory allocator as a new > migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA, with > the restriction that it cannot be used to allocate tagged memory (tag > storage memory cannot be tagged). On tagged page allocation, the > corresponding tag storage is reserved via alloc_contig_range(). > > 2. mprotect(PROT_MTE) is implemented by changing the pte prot to > PAGE_METADATA_NONE. When the page is next accessed, a fault is taken and > the corresponding tag storage is reserved. > > 3. When the code tries to copy tags to a page which doesn't have the tag > storage reserved, the tags are copied to an xarray and restored in > set_pte_at(), when the page is eventually mapped with the tag storage > reserved. Hi! after re-reading it 2 times, I still have no clue what your patch set is actually trying to achieve. Probably there is a way to describe how user space intents to interact with this feature, so to see which value this actually has for user space -- and if we are using the right APIs and allocators. So some dummy questions / statements 1) Is this about re-propusing the memory used to hold tags for different purpose? Or what exactly is user space going to do with the PROT_MTE memory? The whole mprotect(PROT_MTE) approach might not eb the right thing to do. 2) Why do we even have to involve the page allocator if this is some special-purpose memory? Re-porpusing the buddy when later using alloc_contig_range() either way feels wrong. [...] > arch/arm64/Kconfig | 13 + > arch/arm64/include/asm/assembler.h | 10 + > arch/arm64/include/asm/memory_metadata.h | 49 ++ > arch/arm64/include/asm/mte-def.h | 16 +- > arch/arm64/include/asm/mte.h | 40 +- > arch/arm64/include/asm/mte_tag_storage.h | 36 ++ > arch/arm64/include/asm/page.h | 5 +- > arch/arm64/include/asm/pgtable-prot.h | 2 + > arch/arm64/include/asm/pgtable.h | 33 +- > arch/arm64/kernel/Makefile | 1 + > arch/arm64/kernel/elfcore.c | 14 +- > arch/arm64/kernel/hibernate.c | 46 +- > arch/arm64/kernel/mte.c | 31 +- > arch/arm64/kernel/mte_tag_storage.c | 667 +++++++++++++++++++++++ > arch/arm64/kernel/setup.c | 7 + > arch/arm64/kvm/arm.c | 6 +- > arch/arm64/lib/mte.S | 30 +- > arch/arm64/mm/copypage.c | 26 + > arch/arm64/mm/fault.c | 35 +- > arch/arm64/mm/mteswap.c | 113 +++- > fs/proc/meminfo.c | 8 + > fs/proc/page.c | 1 + > include/asm-generic/Kbuild | 1 + > include/asm-generic/memory_metadata.h | 50 ++ > include/linux/gfp.h | 10 + > include/linux/gfp_types.h | 14 +- > include/linux/huge_mm.h | 6 + > include/linux/kernel-page-flags.h | 1 + > include/linux/migrate_mode.h | 1 + > include/linux/mm.h | 12 +- > include/linux/mmzone.h | 26 +- > include/linux/page-flags.h | 1 + > include/linux/pgtable.h | 19 + > include/linux/sched.h | 2 +- > include/linux/sched/mm.h | 13 + > include/linux/vm_event_item.h | 5 + > include/linux/vmstat.h | 2 + > include/trace/events/mmflags.h | 5 +- > mm/Kconfig | 5 + > mm/compaction.c | 52 +- > mm/huge_memory.c | 109 ++++ > mm/internal.h | 7 + > mm/khugepaged.c | 7 + > mm/memory.c | 180 +++++- > mm/mempolicy.c | 7 + > mm/migrate.c | 6 + > mm/mm_init.c | 23 +- > mm/mprotect.c | 46 ++ > mm/page_alloc.c | 136 ++++- > mm/page_isolation.c | 19 +- > mm/page_owner.c | 3 +- > mm/shmem.c | 14 +- > mm/show_mem.c | 4 + > mm/swapfile.c | 4 + > mm/vmscan.c | 3 + > mm/vmstat.c | 13 +- > 56 files changed, 1834 insertions(+), 161 deletions(-) > create mode 100644 arch/arm64/include/asm/memory_metadata.h > create mode 100644 arch/arm64/include/asm/mte_tag_storage.h > create mode 100644 arch/arm64/kernel/mte_tag_storage.c > create mode 100644 include/asm-generic/memory_metadata.h The core-mm changes don't look particularly appealing :)
On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote: > after re-reading it 2 times, I still have no clue what your patch set is > actually trying to achieve. Probably there is a way to describe how user > space intents to interact with this feature, so to see which value this > actually has for user space -- and if we are using the right APIs and > allocators. I'll try with an alternative summary, hopefully it becomes clearer (I think Alex is away until the end of the week, may not reply immediately). If this still doesn't work, maybe we should try a different implementation ;). The way MTE is implemented currently is to have a static carve-out of the DRAM to store the allocation tags (a.k.a. memory colour). This is what we call the tag storage. Each 16 bytes have 4 bits of tags, so this means 1/32 of the DRAM, roughly 3% used for the tag storage. This is done transparently by the hardware/interconnect (with firmware setup) and normally hidden from the OS. So a checked memory access to location X generates a tag fetch from location Y in the carve-out and this tag is compared with the bits 59:56 in the pointer. The correspondence from X to Y is linear (subject to a minimum block size to deal with some address interleaving). The software doesn't need to know about this correspondence as we have specific instructions like STG/LDG to location X that lead to a tag store/load to Y. Now, not all memory used by applications is tagged (mmap(PROT_MTE)). For example, some large allocations may not use PROT_MTE at all or only for the first and last page since initialising the tags takes time. The side-effect is that of these 3% DRAM, only part, say 1% is effectively used. Some people want the unused tag storage to be released for normal data usage (i.e. give it to the kernel page allocator). So the first complication is that a PROT_MTE page allocation at address X will need to reserve the tag storage at location Y (and migrate any data in that page if it is in use). To make things worse, pages in the tag storage/carve-out range cannot use PROT_MTE themselves on current hardware, so this adds the second complication - a heterogeneous memory layout. The kernel needs to know where to allocate a PROT_MTE page from or migrate a current page if it becomes PROT_MTE (mprotect()) and the range it is in does not support tagging. Some other complications are arm64-specific like cache coherency between tags and data accesses. There is a draft architecture spec which will be released soon, detailing how the hardware behaves. To your question about user APIs/ABIs, that's entirely transparent. As with the current kernel (without this dynamic tag storage), a user only needs to ask for PROT_MTE mappings to get tagged pages. > So some dummy questions / statements > > 1) Is this about re-propusing the memory used to hold tags for different > purpose? Yes. To allow part of this 3% to be used for data. It could even be the whole 3% if no application is enabling MTE. > Or what exactly is user space going to do with the PROT_MTE memory? > The whole mprotect(PROT_MTE) approach might not eb the right thing to do. As I mentioned above, there's no difference to the user ABI. PROT_MTE works as before with the kernel moving pages around as needed. > 2) Why do we even have to involve the page allocator if this is some > special-purpose memory? Re-porpusing the buddy when later using > alloc_contig_range() either way feels wrong. The aim here is to rebrand this special-purpose memory as a nearly general-purpose one (bar the PROT_MTE restriction). > The core-mm changes don't look particularly appealing :) OTOH, it's a fun project to learn about the mm ;). Our aim for now is to get some feedback from the mm community on whether this special -> nearly general rebranding is acceptable together with the introduction of a heterogeneous memory concept for the general purpose page allocator. There are some alternatives we looked at with a smaller mm impact but we haven't prototyped them yet: (a) use the available tag storage as a frontswap accelerator or (b) use it as a (compressed) ramdisk that can be mounted as swap. The latter has the advantage of showing up in the available total memory, keeps customers happy ;). Both options would need some mm hooks when a PROT_MTE page gets allocated to release the corresponding page in the tag storage range.
On 24.08.23 12:44, Catalin Marinas wrote: > On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote: >> after re-reading it 2 times, I still have no clue what your patch set is >> actually trying to achieve. Probably there is a way to describe how user >> space intents to interact with this feature, so to see which value this >> actually has for user space -- and if we are using the right APIs and >> allocators. > > I'll try with an alternative summary, hopefully it becomes clearer (I > think Alex is away until the end of the week, may not reply > immediately). If this still doesn't work, maybe we should try a > different implementation ;). > > The way MTE is implemented currently is to have a static carve-out of > the DRAM to store the allocation tags (a.k.a. memory colour). This is > what we call the tag storage. Each 16 bytes have 4 bits of tags, so this > means 1/32 of the DRAM, roughly 3% used for the tag storage. This is > done transparently by the hardware/interconnect (with firmware setup) > and normally hidden from the OS. So a checked memory access to location > X generates a tag fetch from location Y in the carve-out and this tag is > compared with the bits 59:56 in the pointer. The correspondence from X > to Y is linear (subject to a minimum block size to deal with some > address interleaving). The software doesn't need to know about this > correspondence as we have specific instructions like STG/LDG to location > X that lead to a tag store/load to Y. > > Now, not all memory used by applications is tagged (mmap(PROT_MTE)). > For example, some large allocations may not use PROT_MTE at all or only > for the first and last page since initialising the tags takes time. The > side-effect is that of these 3% DRAM, only part, say 1% is effectively > used. Some people want the unused tag storage to be released for normal > data usage (i.e. give it to the kernel page allocator). > > So the first complication is that a PROT_MTE page allocation at address > X will need to reserve the tag storage at location Y (and migrate any > data in that page if it is in use). > > To make things worse, pages in the tag storage/carve-out range cannot > use PROT_MTE themselves on current hardware, so this adds the second > complication - a heterogeneous memory layout. The kernel needs to know > where to allocate a PROT_MTE page from or migrate a current page if it > becomes PROT_MTE (mprotect()) and the range it is in does not support > tagging. > > Some other complications are arm64-specific like cache coherency between > tags and data accesses. There is a draft architecture spec which will be > released soon, detailing how the hardware behaves. > > To your question about user APIs/ABIs, that's entirely transparent. As > with the current kernel (without this dynamic tag storage), a user only > needs to ask for PROT_MTE mappings to get tagged pages. Thanks, that clarifies things a lot. So it sounds like you might want to provide that tag memory using CMA. That way, only movable allocations can end up on that CMA memory area, and you can allocate selected tag pages on demand (similar to the alloc_contig_range() use case). That also solves the issue that such tag memory must not be longterm-pinned. Regarding one complication: "The kernel needs to know where to allocate a PROT_MTE page from or migrate a current page if it becomes PROT_MTE (mprotect()) and the range it is in does not support tagging.", simplified handling would be if it's in a MIGRATE_CMA pageblock, it doesn't support tagging. You have to migrate to a !CMA page (for example, not specifying GFP_MOVABLE as a quick way to achieve that). (I have no idea how tag/tagged memory interacts with memory hotplug, I assume it just doesn't work) > >> So some dummy questions / statements >> >> 1) Is this about re-propusing the memory used to hold tags for different >> purpose? > > Yes. To allow part of this 3% to be used for data. It could even be the > whole 3% if no application is enabling MTE. > >> Or what exactly is user space going to do with the PROT_MTE memory? >> The whole mprotect(PROT_MTE) approach might not eb the right thing to do. > > As I mentioned above, there's no difference to the user ABI. PROT_MTE > works as before with the kernel moving pages around as needed. > >> 2) Why do we even have to involve the page allocator if this is some >> special-purpose memory? Re-porpusing the buddy when later using >> alloc_contig_range() either way feels wrong. > > The aim here is to rebrand this special-purpose memory as a nearly > general-purpose one (bar the PROT_MTE restriction). > >> The core-mm changes don't look particularly appealing :) > > OTOH, it's a fun project to learn about the mm ;). > > Our aim for now is to get some feedback from the mm community on whether > this special -> nearly general rebranding is acceptable together with > the introduction of a heterogeneous memory concept for the general > purpose page allocator. > > There are some alternatives we looked at with a smaller mm impact but we > haven't prototyped them yet: (a) use the available tag storage as a > frontswap accelerator or (b) use it as a (compressed) ramdisk that can Frontswap is no more :) > be mounted as swap. The latter has the advantage of showing up in the > available total memory, keeps customers happy ;). Both options would > need some mm hooks when a PROT_MTE page gets allocated to release the > corresponding page in the tag storage range. Yes, some way of MM integration would be required. If CMA could get the job done, you might get most of what you need already.
On 24.08.23 13:06, David Hildenbrand wrote: > On 24.08.23 12:44, Catalin Marinas wrote: >> On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote: >>> after re-reading it 2 times, I still have no clue what your patch set is >>> actually trying to achieve. Probably there is a way to describe how user >>> space intents to interact with this feature, so to see which value this >>> actually has for user space -- and if we are using the right APIs and >>> allocators. >> >> I'll try with an alternative summary, hopefully it becomes clearer (I >> think Alex is away until the end of the week, may not reply >> immediately). If this still doesn't work, maybe we should try a >> different implementation ;). >> >> The way MTE is implemented currently is to have a static carve-out of >> the DRAM to store the allocation tags (a.k.a. memory colour). This is >> what we call the tag storage. Each 16 bytes have 4 bits of tags, so this >> means 1/32 of the DRAM, roughly 3% used for the tag storage. This is >> done transparently by the hardware/interconnect (with firmware setup) >> and normally hidden from the OS. So a checked memory access to location >> X generates a tag fetch from location Y in the carve-out and this tag is >> compared with the bits 59:56 in the pointer. The correspondence from X >> to Y is linear (subject to a minimum block size to deal with some >> address interleaving). The software doesn't need to know about this >> correspondence as we have specific instructions like STG/LDG to location >> X that lead to a tag store/load to Y. >> >> Now, not all memory used by applications is tagged (mmap(PROT_MTE)). >> For example, some large allocations may not use PROT_MTE at all or only >> for the first and last page since initialising the tags takes time. The >> side-effect is that of these 3% DRAM, only part, say 1% is effectively >> used. Some people want the unused tag storage to be released for normal >> data usage (i.e. give it to the kernel page allocator). >> >> So the first complication is that a PROT_MTE page allocation at address >> X will need to reserve the tag storage at location Y (and migrate any >> data in that page if it is in use). >> >> To make things worse, pages in the tag storage/carve-out range cannot >> use PROT_MTE themselves on current hardware, so this adds the second >> complication - a heterogeneous memory layout. The kernel needs to know >> where to allocate a PROT_MTE page from or migrate a current page if it >> becomes PROT_MTE (mprotect()) and the range it is in does not support >> tagging. >> >> Some other complications are arm64-specific like cache coherency between >> tags and data accesses. There is a draft architecture spec which will be >> released soon, detailing how the hardware behaves. >> >> To your question about user APIs/ABIs, that's entirely transparent. As >> with the current kernel (without this dynamic tag storage), a user only >> needs to ask for PROT_MTE mappings to get tagged pages. > > Thanks, that clarifies things a lot. > > So it sounds like you might want to provide that tag memory using CMA. > > That way, only movable allocations can end up on that CMA memory area, > and you can allocate selected tag pages on demand (similar to the > alloc_contig_range() use case). > > That also solves the issue that such tag memory must not be longterm-pinned. > > Regarding one complication: "The kernel needs to know where to allocate > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE > (mprotect()) and the range it is in does not support tagging.", > simplified handling would be if it's in a MIGRATE_CMA pageblock, it > doesn't support tagging. You have to migrate to a !CMA page (for > example, not specifying GFP_MOVABLE as a quick way to achieve that). > Okay, I now realize that this patch set effectively duplicates some CMA behavior using a new migrate-type. Yeah, that's probably not what we want just to identify if memory is taggable or not. Maybe there is a way to just keep reusing most of CMA instead. Another simpler idea to get started would be to just intercept the first PROT_MTE, and allocate all CMA memory. In that case, systems that don't ever use PROT_MTE can have that additional 3% of memory. You probably know better how frequent it is that only a handful of applications use PROT_MTE, such that there is still a significant portion of tag memory to be reused (and if it's really worth optimizing for that scenario).
On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote: > On 24.08.23 13:06, David Hildenbrand wrote: > > On 24.08.23 12:44, Catalin Marinas wrote: > > > The way MTE is implemented currently is to have a static carve-out of > > > the DRAM to store the allocation tags (a.k.a. memory colour). This is > > > what we call the tag storage. Each 16 bytes have 4 bits of tags, so this > > > means 1/32 of the DRAM, roughly 3% used for the tag storage. This is > > > done transparently by the hardware/interconnect (with firmware setup) > > > and normally hidden from the OS. So a checked memory access to location > > > X generates a tag fetch from location Y in the carve-out and this tag is > > > compared with the bits 59:56 in the pointer. The correspondence from X > > > to Y is linear (subject to a minimum block size to deal with some > > > address interleaving). The software doesn't need to know about this > > > correspondence as we have specific instructions like STG/LDG to location > > > X that lead to a tag store/load to Y. > > > > > > Now, not all memory used by applications is tagged (mmap(PROT_MTE)). > > > For example, some large allocations may not use PROT_MTE at all or only > > > for the first and last page since initialising the tags takes time. The > > > side-effect is that of these 3% DRAM, only part, say 1% is effectively > > > used. Some people want the unused tag storage to be released for normal > > > data usage (i.e. give it to the kernel page allocator). [...] > > So it sounds like you might want to provide that tag memory using CMA. > > > > That way, only movable allocations can end up on that CMA memory area, > > and you can allocate selected tag pages on demand (similar to the > > alloc_contig_range() use case). > > > > That also solves the issue that such tag memory must not be longterm-pinned. > > > > Regarding one complication: "The kernel needs to know where to allocate > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE > > (mprotect()) and the range it is in does not support tagging.", > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it > > doesn't support tagging. You have to migrate to a !CMA page (for > > example, not specifying GFP_MOVABLE as a quick way to achieve that). > > Okay, I now realize that this patch set effectively duplicates some CMA > behavior using a new migrate-type. Yes, pretty much, with some additional hooks to trigger migration. The CMA mechanism was a great source of inspiration. In addition, there are some races that are addressed mostly around page migration/copying: the source page is untagged, the destination allocated as untagged but before the copy an mprotect() makes the source tagged (PG_mte_tagged set) and the copy_highpage() mechanism not having anywhere to store the tags. > Yeah, that's probably not what we want just to identify if memory is > taggable or not. > > Maybe there is a way to just keep reusing most of CMA instead. A potential issue is that devices (mobile phones) may need a different CMA range as well for DMA (and not necessarily in ZONE_DMA). Can free_area[MIGRATE_CMA] handle multiple disjoint ranges? I don't see why not as it's just a list. We (Google and Arm) went through a few rounds of discussions and prototyping trying to find the best approach: (1) a separate free_area[] array in each zone (early proof of concept from Peter C and Evgenii S, https://github.com/google/sanitizers/tree/master/mte-dynamic-carveout), (2) a new ZONE_METADATA, (3) a separate CPU-less NUMA node just for the tag storage, (4) a new MIGRATE_METADATA type. We settled on the latter as it closely resembles CMA without interfering with it. I don't remember why we did not just go for MIGRATE_CMA, it may have been the heterogeneous memory aspect and the fact that we don't want PROT_MTE (VM_MTE) allocations from this range. If the hardware allowed this, I think the patches would have been a bit simpler. Alex can comment more next week on how we ended up with this choice but if we find a way to avoid VM_MTE allocations from certain areas, I think we can reuse the CMA infrastructure. A bigger hammer would be no VM_MTE allocations from any CMA range but it seems too restrictive. > Another simpler idea to get started would be to just intercept the first > PROT_MTE, and allocate all CMA memory. In that case, systems that don't ever > use PROT_MTE can have that additional 3% of memory. We had this on the table as well but the most likely deployment, at least initially, is only some secure services enabling MTE with various apps gradually moving towards this in time. So that's why the main pushback from vendors is having this 3% reserved permanently. Even if all apps use MTE, only the anonymous mappings are PROT_MTE, so still not fully using the tag storage.
Hi, Thank you for the feedback! Catalin did a great job explaining what this patch series does, I'll add my own comments on top of his. On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote: > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote: > > On 24.08.23 13:06, David Hildenbrand wrote: > > > On 24.08.23 12:44, Catalin Marinas wrote: > > > > The way MTE is implemented currently is to have a static carve-out of > > > > the DRAM to store the allocation tags (a.k.a. memory colour). This is > > > > what we call the tag storage. Each 16 bytes have 4 bits of tags, so this > > > > means 1/32 of the DRAM, roughly 3% used for the tag storage. This is > > > > done transparently by the hardware/interconnect (with firmware setup) > > > > and normally hidden from the OS. So a checked memory access to location > > > > X generates a tag fetch from location Y in the carve-out and this tag is > > > > compared with the bits 59:56 in the pointer. The correspondence from X > > > > to Y is linear (subject to a minimum block size to deal with some > > > > address interleaving). The software doesn't need to know about this > > > > correspondence as we have specific instructions like STG/LDG to location > > > > X that lead to a tag store/load to Y. > > > > > > > > Now, not all memory used by applications is tagged (mmap(PROT_MTE)). > > > > For example, some large allocations may not use PROT_MTE at all or only > > > > for the first and last page since initialising the tags takes time. The > > > > side-effect is that of these 3% DRAM, only part, say 1% is effectively > > > > used. Some people want the unused tag storage to be released for normal > > > > data usage (i.e. give it to the kernel page allocator). > [...] > > > So it sounds like you might want to provide that tag memory using CMA. > > > > > > That way, only movable allocations can end up on that CMA memory area, > > > and you can allocate selected tag pages on demand (similar to the > > > alloc_contig_range() use case). > > > > > > That also solves the issue that such tag memory must not be longterm-pinned. > > > > > > Regarding one complication: "The kernel needs to know where to allocate > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE > > > (mprotect()) and the range it is in does not support tagging.", > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it > > > doesn't support tagging. You have to migrate to a !CMA page (for > > > example, not specifying GFP_MOVABLE as a quick way to achieve that). > > > > Okay, I now realize that this patch set effectively duplicates some CMA > > behavior using a new migrate-type. > > Yes, pretty much, with some additional hooks to trigger migration. The > CMA mechanism was a great source of inspiration. > > In addition, there are some races that are addressed mostly around page > migration/copying: the source page is untagged, the destination > allocated as untagged but before the copy an mprotect() makes the source > tagged (PG_mte_tagged set) and the copy_highpage() mechanism not having > anywhere to store the tags. > > > Yeah, that's probably not what we want just to identify if memory is > > taggable or not. > > > > Maybe there is a way to just keep reusing most of CMA instead. > > A potential issue is that devices (mobile phones) may need a different > CMA range as well for DMA (and not necessarily in ZONE_DMA). Can > free_area[MIGRATE_CMA] handle multiple disjoint ranges? I don't see why > not as it's just a list. I don't think that's a problem either, today the user can specify multiple CMA ranges on the kernel command line (via "cma", "hugetlb_cma", etc). CMA already has the mechanism to keep track of multiple regions - it stores in the cma_areas array. > > We (Google and Arm) went through a few rounds of discussions and > prototyping trying to find the best approach: (1) a separate free_area[] > array in each zone (early proof of concept from Peter C and Evgenii S, > https://github.com/google/sanitizers/tree/master/mte-dynamic-carveout), > (2) a new ZONE_METADATA, (3) a separate CPU-less NUMA node just for the > tag storage, (4) a new MIGRATE_METADATA type. > > We settled on the latter as it closely resembles CMA without interfering > with it. I don't remember why we did not just go for MIGRATE_CMA, it may > have been the heterogeneous memory aspect and the fact that we don't > want PROT_MTE (VM_MTE) allocations from this range. If the hardware > allowed this, I think the patches would have been a bit simpler. You are correct, we settled on a new migrate type because the tag storage memory is fundamentally a different memory type with different properties than the rest of the memory in the system: tag storage memory cannot be tagged, MIGRATE_CMA memory can be tagged. > > Alex can comment more next week on how we ended up with this choice but > if we find a way to avoid VM_MTE allocations from certain areas, I think > we can reuse the CMA infrastructure. A bigger hammer would be no VM_MTE > allocations from any CMA range but it seems too restrictive. I considered mixing the tag storage memory memory with normal memory and adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged, this means that it's not enough anymore to have a __GFP_MOVABLE allocation request to use MIGRATE_CMA. I considered two solutions to this problem: 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged => this effectively means transforming all memory from MIGRATE_CMA into the MIGRATE_METADATA migratetype that the series introduces. Not very appealing, because that means treating normal memory that is also on the MIGRATE_CMA lists as tagged memory. 2. Keep track of which pages are tag storage at page granularity (either by a page flag, or by checking that the pfn falls in one of the tag storage region, or by some other mechanism). When the page allocator takes free pages from the MIGRATE_METADATA list to satisfy an allocation, compare the gfp mask with the page type, and if the allocation is tagged and the page is a tag storage page, put it back at the tail of the free list and choose the next page. Repeat until the page allocator finds a normal memory page that can be tagged (some refinements obviously needed to need to avoid infinite loops). I considered solution 2 to be more complicated than keeping track of tag storage page at the migratetype level. Conceptually, keeping two distinct memory type on separate migrate types looked to me like the cleaner and simpler solution. Maybe I missed something, I'm definitely open to suggestions regarding putting the tag storage pages on MIGRATE_CMA (or another migratetype) if that's a better approach. Might be worth pointing out that putting the tag storage memory on the MIGRATE_CMA migratetype only changes how the page allocator allocates pages; all the other changes to migration/compaction/mprotect/etc will still be there, because they are needed not because of how the tag storage memory is represented by the page allocator, but because tag storage memory cannot be tagged, and regular memory can. Thanks, Alex > > > Another simpler idea to get started would be to just intercept the first > > PROT_MTE, and allocate all CMA memory. In that case, systems that don't ever > > use PROT_MTE can have that additional 3% of memory. > > We had this on the table as well but the most likely deployment, at > least initially, is only some secure services enabling MTE with various > apps gradually moving towards this in time. So that's why the main > pushback from vendors is having this 3% reserved permanently. Even if > all apps use MTE, only the anonymous mappings are PROT_MTE, so still not > fully using the tag storage. > > -- > Catalin >
On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote: > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote: > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote: > > > On 24.08.23 13:06, David Hildenbrand wrote: > > > > Regarding one complication: "The kernel needs to know where to allocate > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE > > > > (mprotect()) and the range it is in does not support tagging.", > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it > > > > doesn't support tagging. You have to migrate to a !CMA page (for > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that). > > > > > > Okay, I now realize that this patch set effectively duplicates some CMA > > > behavior using a new migrate-type. [...] > I considered mixing the tag storage memory memory with normal memory and > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged, > this means that it's not enough anymore to have a __GFP_MOVABLE allocation > request to use MIGRATE_CMA. > > I considered two solutions to this problem: > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged => > this effectively means transforming all memory from MIGRATE_CMA into the > MIGRATE_METADATA migratetype that the series introduces. Not very > appealing, because that means treating normal memory that is also on the > MIGRATE_CMA lists as tagged memory. That's indeed not ideal. We could try this if it makes the patches significantly simpler, though I'm not so sure. Allocating metadata is the easier part as we know the correspondence from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag storage page), so alloc_contig_range() does this for us. Just adding it to the CMA range is sufficient. However, making sure that we don't allocate PROT_MTE pages from the metadata range is what led us to another migrate type. I guess we could achieve something similar with a new zone or a CPU-less NUMA node, though the latter is not guaranteed not to allocate memory from the range, only make it less likely. Both these options are less flexible in terms of size/alignment/placement. Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and configure the metadata range in ZONE_MOVABLE but at some point I'd expect some CXL-attached memory to support MTE with additional carveout reserved. To recap, in this series, a PROT_MTE page allocation starts with a typical allocation from anywhere other than MIGRATE_METADATA followed by the hooks to reserve the corresponding metadata range at (pfn * 128 + offset) for a 4K page. The whole metadata page is reserved, so the adjacent 31 pages around the original allocation can also be mapped as PROT_MTE. (Peter and Evgenii @ Google had a slightly different approach in their prototype: separate free_area[] array for PROT_MTE pages; while it has some advantages, I found it more intrusive since the same page can be on a free_area/free_list or another) > 2. Keep track of which pages are tag storage at page granularity (either by > a page flag, or by checking that the pfn falls in one of the tag storage > region, or by some other mechanism). When the page allocator takes free > pages from the MIGRATE_METADATA list to satisfy an allocation, compare the > gfp mask with the page type, and if the allocation is tagged and the page > is a tag storage page, put it back at the tail of the free list and choose > the next page. Repeat until the page allocator finds a normal memory page > that can be tagged (some refinements obviously needed to need to avoid > infinite loops). With large enough CMA areas, there's a real risk of latency spikes, RCU stalls etc. Not really keen on such heuristics.
On 11.09.23 13:52, Catalin Marinas wrote: > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote: >> On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote: >>> On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote: >>>> On 24.08.23 13:06, David Hildenbrand wrote: >>>>> Regarding one complication: "The kernel needs to know where to allocate >>>>> a PROT_MTE page from or migrate a current page if it becomes PROT_MTE >>>>> (mprotect()) and the range it is in does not support tagging.", >>>>> simplified handling would be if it's in a MIGRATE_CMA pageblock, it >>>>> doesn't support tagging. You have to migrate to a !CMA page (for >>>>> example, not specifying GFP_MOVABLE as a quick way to achieve that). >>>> >>>> Okay, I now realize that this patch set effectively duplicates some CMA >>>> behavior using a new migrate-type. > [...] >> I considered mixing the tag storage memory memory with normal memory and >> adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged, >> this means that it's not enough anymore to have a __GFP_MOVABLE allocation >> request to use MIGRATE_CMA. >> >> I considered two solutions to this problem: >> >> 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged => >> this effectively means transforming all memory from MIGRATE_CMA into the >> MIGRATE_METADATA migratetype that the series introduces. Not very >> appealing, because that means treating normal memory that is also on the >> MIGRATE_CMA lists as tagged memory. > > That's indeed not ideal. We could try this if it makes the patches > significantly simpler, though I'm not so sure. > > Allocating metadata is the easier part as we know the correspondence > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag > storage page), so alloc_contig_range() does this for us. Just adding it > to the CMA range is sufficient. > > However, making sure that we don't allocate PROT_MTE pages from the > metadata range is what led us to another migrate type. I guess we could > achieve something similar with a new zone or a CPU-less NUMA node, Ideally, no significant core-mm changes to optimize for an architecture oddity. That implies, no new zones and no new migratetypes -- unless it is unavoidable and you are confident that you can convince core-MM people that the use case (giving back 3% of system RAM at max in some setups) is worth the trouble. I also had CPU-less NUMA nodes in mind when thinking about that, but not sure how easy it would be to integrate it. If the tag memory has actually different performance characteristics as well, a NUMA node would be the right choice. If we could find some way to easily support this either via CMA or CPU-less NUMA nodes, that would be much preferable; even if we cannot cover each and every future use case right now. I expect some issues with CXL+MTE either way , but are happy to be taught otherwise :) Another thought I had was adding something like CMA memory characteristics. Like, asking if a given CMA area/page supports tagging (i.e., flag for the CMA area set?)? When you need memory that supports tagging and have a page that does not support tagging (CMA && taggable), simply migrate to !MOVABLE memory (eventually we could also try adding !CMA). Was that discussed and what would be the challenges with that? Page migration due to compaction comes to mind, but it might also be easy to handle if we can just avoid CMA memory for that. > though the latter is not guaranteed not to allocate memory from the > range, only make it less likely. Both these options are less flexible in > terms of size/alignment/placement. > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and > configure the metadata range in ZONE_MOVABLE but at some point I'd > expect some CXL-attached memory to support MTE with additional carveout > reserved. I have no idea how we could possibly cleanly support memory hotplug in virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to s390x storage keys, the approach that arm64 with MTE took here (exposing tag memory to the VM) makes it rather hard and complicated.
On Wed, 2023-08-23 at 14:13 +0100, Alexandru Elisei wrote: > Introduction > ============ > > Arm has implemented memory coloring in hardware, and the feature is > called > Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in > bits > 59..56 of a pointer, and storing this tag to a reserved memory > location. > When the pointer is dereferenced, the hardware compares the tag > embedded in > the pointer (logical tag) with the tag stored in memory (allocation > tag). > > The relation between memory and where the tag for that memory is > stored is > static. > > The memory where the tags are stored have been so far unaccessible to > Linux. > This series aims to change that, by adding support for using the tag > storage > memory only as data memory; tag storage memory cannot be itself > tagged. > > > Implementation > ============== > > The series is based on v6.5-rc3 with these two patches cherry picked: > > - mm: Call arch_swap_restore() from unuse_pte(): > > > https://lore.kernel.org/all/20230523004312.1807357-3-pcc@google.com/ > > - arm64: mte: Simplify swap tag restoration logic: > > > https://lore.kernel.org/all/20230523004312.1807357-4-pcc@google.com/ > > The above two patches are queued for the v6.6 merge window: > > > https://lore.kernel.org/all/20230702123821.04e64ea2c04dd0fdc947bda3@linux-foundation.org/ > > The entire series, including the above patches, can be cloned with: > > $ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \ > -b arm-mte-dynamic-carveout-rfc-v1 > > On the arm64 architecture side, an extension is being worked on that > will > clarify how MTE tag storage reuse should behave. The extension will > be > made public soon. > > On the Linux side, MTE tag storage reuse is accomplished with the > following changes: > > 1. The tag storage memory is exposed to the memory allocator as a new > migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA, > with > the restriction that it cannot be used to allocate tagged memory (tag > storage memory cannot be tagged). On tagged page allocation, the > corresponding tag storage is reserved via alloc_contig_range(). > > 2. mprotect(PROT_MTE) is implemented by changing the pte prot to > PAGE_METADATA_NONE. When the page is next accessed, a fault is taken > and > the corresponding tag storage is reserved. > > 3. When the code tries to copy tags to a page which doesn't have the > tag > storage reserved, the tags are copied to an xarray and restored in > set_pte_at(), when the page is eventually mapped with the tag storage > reserved. > > KVM support has not been implemented yet, that because a non-MTE > enabled VMA > can back the memory of an MTE-enabled VM. After there is a consensus > on the > right approach on the memory management support, I will add it. > > Explanations for the last two changes follow. The gist of it is that > they > were added mostly because of races, and it my intention to make the > code > more robust. > > PAGE_METADATA_NONE was introduced to avoid races with > mprotect(PROT_MTE). > For example, migration can race with mprotect(PROT_MTE): > - thread 0 initiates migration for a page in a non-MTE enabled VMA > and a > destination page is allocated without tag storage. > - thread 1 handles an mprotect(PROT_MTE), the VMA becomes tagged, and > an > access turns the source page that is in the process of being > migrated > into a tagged page. > - thread 0 finishes migration and the destination page is mapped as > tagged, > but without tag storage reserved. > More details and examples can be found in the patches. > > This race is also related to how tag restoring is handled when tag > storage > is missing: when a tagged page is swapped out, the tags are saved in > an > xarray indexed by swp_entry.val. When a page is swapped back in, if > there > are tags corresponding to the swp_entry that the page will replace, > the > tags are unconditionally restored, even if the page will be mapped as > untagged. Because the page will be mapped as untagged, tag storage > was > not reserved when the page was allocated to replace the swp_entry > which has > tags associated with it. > > To get around this, save the tags in a new xarray, this time indexed > by > pfn, and restore them when the same page is mapped as tagged. > > This also solves another race, this time with copy_highpage. In the > scenario where migration races with mprotect(PROT_MTE), before the > page is > mapped, the contents of the source page is copied to the destination. > And > this includes tags, which will be copied to a page with missing tag > storage, which can to data corruption if the missing tag storage is > in use > for data. So copy_highpage() has received a similar treatment to the > swap > code, and the source tags are copied in the xarray indexed by the > destination page pfn. > > > Overview of the patches > ======================= > > Patches 1-3 do some preparatory work by renaming a few functions and > a gfp > flag. > > Patches 4-12 are arch independent and introduce MIGRATE_METADATA to > the > page allocator. > > Patches 13-18 are arm64 specific and add support for detecting the > tag > storage region and onlining it with the MIGRATE_METADATA migratetype. > > Patches 19-24 are arch independent and modify the page allocator to > callback into arch dependant functions to reserve metadata storage > for an > allocation which requires metadata. > > Patches 25-28 are mostly arm64 specific and implement the reservation > and > freeing of tag storage on tagged page allocation. Patch #28 ("mm: > sched: > Introduce PF_MEMALLOC_ISOLATE") adds a current flag, > PF_MEMALLOC_ISOLATE, > which ignores page isolation limits; this is used by arm64 when > reserving > tag storage in the same patch. > > Patches 29-30 add arch independent support for doing > mprotect(PROT_MTE) > when metadata storage is enabled. > > Patches 31-37 are mostly arm64 specific and handle the restoring of > tags > when tag storage is missing. The exceptions are patches 32 (adds the > arch_swap_prepare_to_restore() function) and 35 (add > PAGE_METADATA_NONE > support for THPs). > > Testing > ======= > > To enable MTE dynamic tag storage: > > - CONFIG_ARM64_MTE_TAG_STORAGE=y > - system_supports_mte() returns true > - kasan_hw_tags_enabled() returns false > - correct DTB node (for the specification, see commit "arm64: mte: > Reserve tag > storage memory") > > Check dmesg for the message "MTE tag storage enabled" or grep for > metadata > in /proc/vmstat. > > I've tested the series using FVP with MTE enabled, but without > support for > dynamic tag storage reuse. To simulate it, I've added two fake tag > storage > regions in the DTB by splitting a 2GB region roughly into 33 slices > of size > 0x3e0_0000, and using 32 of them for tagged memory and one slice for > tag > storage: > > diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts > b/arch/arm64/boot/dts/arm/fvp-base-revc.dts > index 60472d65a355..bd050373d6cf 100644 > --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts > +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts > @@ -165,10 +165,28 @@ C1_L2: l2-cache1 { > }; > }; > > - memory@80000000 { > + memory0: memory@80000000 { > device_type = "memory"; > - reg = <0x00000000 0x80000000 0 0x80000000>, > - <0x00000008 0x80000000 0 0x80000000>; > + reg = <0x00 0x80000000 0x00 0x7c000000>; > + }; > + > + metadata0: metadata@c0000000 { > + compatible = "arm,mte-tag-storage"; > + reg = <0x00 0xfc000000 0x00 0x3e00000>; > + block-size = <0x1000>; > + memory = <&memory0>; > + }; > + > + memory1: memory@880000000 { > + device_type = "memory"; > + reg = <0x08 0x80000000 0x00 0x7c000000>; > + }; > + > + metadata1: metadata@8c0000000 { > + compatible = "arm,mte-tag-storage"; > + reg = <0x08 0xfc000000 0x00 0x3e00000>; > + block-size = <0x1000>; > + memory = <&memory1>; > }; > Hi Alexandru, AFAIK, the above memory configuration means that there are two region of dram(0x80000000-0xfc000000 and 0x8_80000000-0x8_fc0000000) and this is called PDD memory map. Document[1] said there are some constraints of tag memory as below. | The following constraints apply to the tag regions in DRAM: | 1. The tag region cannot be interleaved with the data region. | The tag region must also be above the data region within DRAM. | | 2.The tag region in the physical address space cannot straddle | multiple regions of a memory map. | | PDD memory map is not allowed to have part of the tag region between | 2GB-4GB and another part between 34GB-64GB. I'm not sure if we can separate tag memory with the above configuration. Or do I miss something? [1] https://developer.arm.com/documentation/101569/0300/?lang=en (Section 5.4.6.1) Thanks, Kuan-Ying Lee > reserved-memory { > > > Alexandru Elisei (37): > mm: page_alloc: Rename gfp_to_alloc_flags_cma -> > gfp_to_alloc_flags_fast > arm64: mte: Rework naming for tag manipulation functions > arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED > mm: Add MIGRATE_METADATA allocation policy > mm: Add memory statistics for the MIGRATE_METADATA allocation > policy > mm: page_alloc: Allocate from movable pcp lists only if > ALLOC_FROM_METADATA > mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages > mm: compaction: Account for free metadata pages in > __compact_finished() > mm: compaction: Handle metadata pages as source for direct > compaction > mm: compaction: Do not use MIGRATE_METADATA to replace pages with > metadata > mm: migrate/mempolicy: Allocate metadata-enabled destination page > mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages > arm64: mte: Reserve tag storage memory > arm64: mte: Expose tag storage pages to the MIGRATE_METADATA > freelist > arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK > arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is > disabled > arm64: mte: Disable dynamic tag storage management if HW KASAN is > enabled > arm64: mte: Check that tag storage blocks are in the same zone > mm: page_alloc: Manage metadata storage on page allocation > mm: compaction: Reserve metadata storage in compaction_alloc() > mm: khugepaged: Handle metadata-enabled VMAs > mm: shmem: Allocate metadata storage for in-memory filesystems > mm: Teach vma_alloc_folio() about metadata-enabled VMAs > mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA > arm64: mte: Manage tag storage on page allocation > arm64: mte: Perform CMOs for tag blocks on tagged page > allocation/free > arm64: mte: Reserve tag block for the zero page > mm: sched: Introduce PF_MEMALLOC_ISOLATE > mm: arm64: Define the PAGE_METADATA_NONE page protection > mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE) > mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing > metadata > storage > mm: Call arch_swap_prepare_to_restore() before arch_swap_restore() > arm64: mte: swap/copypage: Handle tag restoring when missing tag > storage > arm64: mte: Handle fatal signal in reserve_metadata_storage() > mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages > KVM: arm64: Disable MTE is tag storage is enabled > arm64: mte: Enable tag storage management > > arch/arm64/Kconfig | 13 + > arch/arm64/include/asm/assembler.h | 10 + > arch/arm64/include/asm/memory_metadata.h | 49 ++ > arch/arm64/include/asm/mte-def.h | 16 +- > arch/arm64/include/asm/mte.h | 40 +- > arch/arm64/include/asm/mte_tag_storage.h | 36 ++ > arch/arm64/include/asm/page.h | 5 +- > arch/arm64/include/asm/pgtable-prot.h | 2 + > arch/arm64/include/asm/pgtable.h | 33 +- > arch/arm64/kernel/Makefile | 1 + > arch/arm64/kernel/elfcore.c | 14 +- > arch/arm64/kernel/hibernate.c | 46 +- > arch/arm64/kernel/mte.c | 31 +- > arch/arm64/kernel/mte_tag_storage.c | 667 > +++++++++++++++++++++++ > arch/arm64/kernel/setup.c | 7 + > arch/arm64/kvm/arm.c | 6 +- > arch/arm64/lib/mte.S | 30 +- > arch/arm64/mm/copypage.c | 26 + > arch/arm64/mm/fault.c | 35 +- > arch/arm64/mm/mteswap.c | 113 +++- > fs/proc/meminfo.c | 8 + > fs/proc/page.c | 1 + > include/asm-generic/Kbuild | 1 + > include/asm-generic/memory_metadata.h | 50 ++ > include/linux/gfp.h | 10 + > include/linux/gfp_types.h | 14 +- > include/linux/huge_mm.h | 6 + > include/linux/kernel-page-flags.h | 1 + > include/linux/migrate_mode.h | 1 + > include/linux/mm.h | 12 +- > include/linux/mmzone.h | 26 +- > include/linux/page-flags.h | 1 + > include/linux/pgtable.h | 19 + > include/linux/sched.h | 2 +- > include/linux/sched/mm.h | 13 + > include/linux/vm_event_item.h | 5 + > include/linux/vmstat.h | 2 + > include/trace/events/mmflags.h | 5 +- > mm/Kconfig | 5 + > mm/compaction.c | 52 +- > mm/huge_memory.c | 109 ++++ > mm/internal.h | 7 + > mm/khugepaged.c | 7 + > mm/memory.c | 180 +++++- > mm/mempolicy.c | 7 + > mm/migrate.c | 6 + > mm/mm_init.c | 23 +- > mm/mprotect.c | 46 ++ > mm/page_alloc.c | 136 ++++- > mm/page_isolation.c | 19 +- > mm/page_owner.c | 3 +- > mm/shmem.c | 14 +- > mm/show_mem.c | 4 + > mm/swapfile.c | 4 + > mm/vmscan.c | 3 + > mm/vmstat.c | 13 +- > 56 files changed, 1834 insertions(+), 161 deletions(-) > create mode 100644 arch/arm64/include/asm/memory_metadata.h > create mode 100644 arch/arm64/include/asm/mte_tag_storage.h > create mode 100644 arch/arm64/kernel/mte_tag_storage.c > create mode 100644 include/asm-generic/memory_metadata.h >
On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote: > On 11.09.23 13:52, Catalin Marinas wrote: > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote: > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote: > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote: > > > > > On 24.08.23 13:06, David Hildenbrand wrote: > > > > > > Regarding one complication: "The kernel needs to know where to allocate > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE > > > > > > (mprotect()) and the range it is in does not support tagging.", > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that). > > > > > > > > > > Okay, I now realize that this patch set effectively duplicates some CMA > > > > > behavior using a new migrate-type. > > [...] > > > I considered mixing the tag storage memory memory with normal memory and > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged, > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation > > > request to use MIGRATE_CMA. > > > > > > I considered two solutions to this problem: > > > > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged => > > > this effectively means transforming all memory from MIGRATE_CMA into the > > > MIGRATE_METADATA migratetype that the series introduces. Not very > > > appealing, because that means treating normal memory that is also on the > > > MIGRATE_CMA lists as tagged memory. > > > > That's indeed not ideal. We could try this if it makes the patches > > significantly simpler, though I'm not so sure. > > > > Allocating metadata is the easier part as we know the correspondence > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag > > storage page), so alloc_contig_range() does this for us. Just adding it > > to the CMA range is sufficient. > > > > However, making sure that we don't allocate PROT_MTE pages from the > > metadata range is what led us to another migrate type. I guess we could > > achieve something similar with a new zone or a CPU-less NUMA node, > > Ideally, no significant core-mm changes to optimize for an architecture > oddity. That implies, no new zones and no new migratetypes -- unless it is > unavoidable and you are confident that you can convince core-MM people that > the use case (giving back 3% of system RAM at max in some setups) is worth > the trouble. If I was an mm maintainer, I'd also question this ;). But vendors seem pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a 16G platform does look somewhat big). As more and more apps adopt MTE, the wastage would be smaller but the first step is getting vendors to enable it. > I also had CPU-less NUMA nodes in mind when thinking about that, but not > sure how easy it would be to integrate it. If the tag memory has actually > different performance characteristics as well, a NUMA node would be the > right choice. In general I'd expect the same characteristics. However, changing the memory designation from tag to data (and vice-versa) requires some cache maintenance. The allocation cost is slightly higher (not the runtime one), so it would help if the page allocator does not favour this range. Anyway, that's an optimisation to worry about later. > If we could find some way to easily support this either via CMA or CPU-less > NUMA nodes, that would be much preferable; even if we cannot cover each and > every future use case right now. I expect some issues with CXL+MTE either > way , but are happy to be taught otherwise :) I think CXL+MTE is rather theoretical at the moment. Given that PCIe doesn't have any notion of MTE, more likely there would be some piece of interconnect that generates two memory accesses: one for data and the other for tags at a configurable offset (which may or may not be in the same CXL range). > Another thought I had was adding something like CMA memory characteristics. > Like, asking if a given CMA area/page supports tagging (i.e., flag for the > CMA area set?)? I don't think adding CMA memory characteristics helps much. The metadata allocation wouldn't go through cma_alloc() but rather alloc_contig_range() directly for a specific pfn corresponding to the data pages with PROT_MTE. The core mm code doesn't need to know about the tag storage layout. It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE. That's typically coming from device drivers (DMA API) with their own mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and therefore PROT_MTE is rejected). What we need though is to prevent vma_alloc_folio() from allocating from a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically removing __GFP_MOVABLE in those cases. As long as we don't have large ZONE_MOVABLE areas, it shouldn't be an issue. > When you need memory that supports tagging and have a page that does not > support tagging (CMA && taggable), simply migrate to !MOVABLE memory > (eventually we could also try adding !CMA). > > Was that discussed and what would be the challenges with that? Page > migration due to compaction comes to mind, but it might also be easy to > handle if we can just avoid CMA memory for that. IIRC that was because PROT_MTE pages would have to come only from !MOVABLE ranges. Maybe that's not such big deal. We'll give this a go and hopefully it simplifies the patches a bit (it will take a while as Alex keeps going on holiday ;)). In the meantime, I'm talking to the hardware people to see whether we can have MTE pages in the tag storage/metadata range. We'd still need to reserve about 0.1% of the RAM for the metadata corresponding to the tag storage range when used as data but that's negligible (1/32 of 1/32). So if some future hardware allows this, we can drop the page allocation restriction from the CMA range. > > though the latter is not guaranteed not to allocate memory from the > > range, only make it less likely. Both these options are less flexible in > > terms of size/alignment/placement. > > > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and > > configure the metadata range in ZONE_MOVABLE but at some point I'd > > expect some CXL-attached memory to support MTE with additional carveout > > reserved. > > I have no idea how we could possibly cleanly support memory hotplug in > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to > s390x storage keys, the approach that arm64 with MTE took here (exposing tag > memory to the VM) makes it rather hard and complicated. The current thinking is that the VM is not aware of the tag storage, that's entirely managed by the host. The host would treat the guest memory similarly to the PROT_MTE user allocations, reserve metadata etc. Thanks for the feedback so far, very useful.
Hi Kuan-Ying, On Wed, Sep 13, 2023 at 08:11:40AM +0000, Kuan-Ying Lee (李冠穎) wrote: > On Wed, 2023-08-23 at 14:13 +0100, Alexandru Elisei wrote: > > diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts > > b/arch/arm64/boot/dts/arm/fvp-base-revc.dts > > index 60472d65a355..bd050373d6cf 100644 > > --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts > > +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts > > @@ -165,10 +165,28 @@ C1_L2: l2-cache1 { > > }; > > }; > > > > - memory@80000000 { > > + memory0: memory@80000000 { > > device_type = "memory"; > > - reg = <0x00000000 0x80000000 0 0x80000000>, > > - <0x00000008 0x80000000 0 0x80000000>; > > + reg = <0x00 0x80000000 0x00 0x7c000000>; > > + }; > > + > > + metadata0: metadata@c0000000 { > > + compatible = "arm,mte-tag-storage"; > > + reg = <0x00 0xfc000000 0x00 0x3e00000>; > > + block-size = <0x1000>; > > + memory = <&memory0>; > > + }; > > + > > + memory1: memory@880000000 { > > + device_type = "memory"; > > + reg = <0x08 0x80000000 0x00 0x7c000000>; > > + }; > > + > > + metadata1: metadata@8c0000000 { > > + compatible = "arm,mte-tag-storage"; > > + reg = <0x08 0xfc000000 0x00 0x3e00000>; > > + block-size = <0x1000>; > > + memory = <&memory1>; > > }; > > > > AFAIK, the above memory configuration means that there are two region > of dram(0x80000000-0xfc000000 and 0x8_80000000-0x8_fc0000000) and this > is called PDD memory map. > > Document[1] said there are some constraints of tag memory as below. > > | The following constraints apply to the tag regions in DRAM: > | 1. The tag region cannot be interleaved with the data region. > | The tag region must also be above the data region within DRAM. > | > | 2.The tag region in the physical address space cannot straddle > | multiple regions of a memory map. > | > | PDD memory map is not allowed to have part of the tag region between > | 2GB-4GB and another part between 34GB-64GB. > > I'm not sure if we can separate tag memory with the above > configuration. Or do I miss something? > > [1] https://developer.arm.com/documentation/101569/0300/?lang=en > (Section 5.4.6.1) Good point, thanks. The above dts some random layout we picked as an example, it doesn't match any real hardware and we didn't pay attention to the interconnect limitations (we fake the tag storage on the model). I'll try to dig out how the mtu_tag_addr_shutter registers work and how the sparse DRAM space is compressed to a smaller tag range. But that's something done by firmware and the kernel only learns the tag storage location from the DT (provided by firmware). We also don't need to know the fine-grained mapping between 32 bytes of data and 1 byte (2 tags) in the tag storage, only the block size in the tag storage space that covers all interleaving done by the interconnect (it can be from 1 byte to something larger like a page; the kernel will then use the lowest common multiple between a page size and this tag block size to figure out how many pages to reserve).
On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote: > On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote: > > On 11.09.23 13:52, Catalin Marinas wrote: > > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote: > > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote: > > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote: > > > > > > On 24.08.23 13:06, David Hildenbrand wrote: > > > > > > > Regarding one complication: "The kernel needs to know where to allocate > > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE > > > > > > > (mprotect()) and the range it is in does not support tagging.", > > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it > > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for > > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that). > > > > > > > > > > > > Okay, I now realize that this patch set effectively duplicates some CMA > > > > > > behavior using a new migrate-type. > > > [...] > > > > I considered mixing the tag storage memory memory with normal memory and > > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged, > > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation > > > > request to use MIGRATE_CMA. > > > > > > > > I considered two solutions to this problem: > > > > > > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged => > > > > this effectively means transforming all memory from MIGRATE_CMA into the > > > > MIGRATE_METADATA migratetype that the series introduces. Not very > > > > appealing, because that means treating normal memory that is also on the > > > > MIGRATE_CMA lists as tagged memory. > > > > > > That's indeed not ideal. We could try this if it makes the patches > > > significantly simpler, though I'm not so sure. > > > > > > Allocating metadata is the easier part as we know the correspondence > > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag > > > storage page), so alloc_contig_range() does this for us. Just adding it > > > to the CMA range is sufficient. > > > > > > However, making sure that we don't allocate PROT_MTE pages from the > > > metadata range is what led us to another migrate type. I guess we could > > > achieve something similar with a new zone or a CPU-less NUMA node, > > > > Ideally, no significant core-mm changes to optimize for an architecture > > oddity. That implies, no new zones and no new migratetypes -- unless it is > > unavoidable and you are confident that you can convince core-MM people that > > the use case (giving back 3% of system RAM at max in some setups) is worth > > the trouble. > > If I was an mm maintainer, I'd also question this ;). But vendors seem > pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a > 16G platform does look somewhat big). As more and more apps adopt MTE, > the wastage would be smaller but the first step is getting vendors to > enable it. > > > I also had CPU-less NUMA nodes in mind when thinking about that, but not > > sure how easy it would be to integrate it. If the tag memory has actually > > different performance characteristics as well, a NUMA node would be the > > right choice. > > In general I'd expect the same characteristics. However, changing the > memory designation from tag to data (and vice-versa) requires some cache > maintenance. The allocation cost is slightly higher (not the runtime > one), so it would help if the page allocator does not favour this range. > Anyway, that's an optimisation to worry about later. > > > If we could find some way to easily support this either via CMA or CPU-less > > NUMA nodes, that would be much preferable; even if we cannot cover each and > > every future use case right now. I expect some issues with CXL+MTE either > > way , but are happy to be taught otherwise :) > > I think CXL+MTE is rather theoretical at the moment. Given that PCIe > doesn't have any notion of MTE, more likely there would be some piece of > interconnect that generates two memory accesses: one for data and the > other for tags at a configurable offset (which may or may not be in the > same CXL range). > > > Another thought I had was adding something like CMA memory characteristics. > > Like, asking if a given CMA area/page supports tagging (i.e., flag for the > > CMA area set?)? > > I don't think adding CMA memory characteristics helps much. The metadata > allocation wouldn't go through cma_alloc() but rather > alloc_contig_range() directly for a specific pfn corresponding to the > data pages with PROT_MTE. The core mm code doesn't need to know about > the tag storage layout. > > It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE. > That's typically coming from device drivers (DMA API) with their own > mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and > therefore PROT_MTE is rejected). > > What we need though is to prevent vma_alloc_folio() from allocating from > a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically > removing __GFP_MOVABLE in those cases. As long as we don't have large > ZONE_MOVABLE areas, it shouldn't be an issue. > How about unsetting ALLOC_CMA if GFP_TAGGED ? Removing __GFP_MOVABLE may cause movable pages to be allocated in un unmovable migratetype, which may not be desirable for page fragmentation. > > When you need memory that supports tagging and have a page that does not > > support tagging (CMA && taggable), simply migrate to !MOVABLE memory > > (eventually we could also try adding !CMA). > > > > Was that discussed and what would be the challenges with that? Page > > migration due to compaction comes to mind, but it might also be easy to > > handle if we can just avoid CMA memory for that. > > IIRC that was because PROT_MTE pages would have to come only from > !MOVABLE ranges. Maybe that's not such big deal. > Could you explain what it means that PROT_MTE have to come only from !MOVABLE range ? I don't understand this part very well. Thanks, Hyesoo. > We'll give this a go and hopefully it simplifies the patches a bit (it > will take a while as Alex keeps going on holiday ;)). In the meantime, > I'm talking to the hardware people to see whether we can have MTE pages > in the tag storage/metadata range. We'd still need to reserve about 0.1% > of the RAM for the metadata corresponding to the tag storage range when > used as data but that's negligible (1/32 of 1/32). So if some future > hardware allows this, we can drop the page allocation restriction from > the CMA range. > > > > though the latter is not guaranteed not to allocate memory from the > > > range, only make it less likely. Both these options are less flexible in > > > terms of size/alignment/placement. > > > > > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and > > > configure the metadata range in ZONE_MOVABLE but at some point I'd > > > expect some CXL-attached memory to support MTE with additional carveout > > > reserved. > > > > I have no idea how we could possibly cleanly support memory hotplug in > > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to > > s390x storage keys, the approach that arm64 with MTE took here (exposing tag > > memory to the VM) makes it rather hard and complicated. > > The current thinking is that the VM is not aware of the tag storage, > that's entirely managed by the host. The host would treat the guest > memory similarly to the PROT_MTE user allocations, reserve metadata etc. > > Thanks for the feedback so far, very useful. > > -- > Catalin >
Hi, On Wed, Oct 25, 2023 at 11:59:32AM +0900, Hyesoo Yu wrote: > On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote: > > On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote: > > > On 11.09.23 13:52, Catalin Marinas wrote: > > > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote: > > > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote: > > > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote: > > > > > > > On 24.08.23 13:06, David Hildenbrand wrote: > > > > > > > > Regarding one complication: "The kernel needs to know where to allocate > > > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE > > > > > > > > (mprotect()) and the range it is in does not support tagging.", > > > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it > > > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for > > > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that). > > > > > > > > > > > > > > Okay, I now realize that this patch set effectively duplicates some CMA > > > > > > > behavior using a new migrate-type. > > > > [...] > > > > > I considered mixing the tag storage memory memory with normal memory and > > > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged, > > > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation > > > > > request to use MIGRATE_CMA. > > > > > > > > > > I considered two solutions to this problem: > > > > > > > > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged => > > > > > this effectively means transforming all memory from MIGRATE_CMA into the > > > > > MIGRATE_METADATA migratetype that the series introduces. Not very > > > > > appealing, because that means treating normal memory that is also on the > > > > > MIGRATE_CMA lists as tagged memory. > > > > > > > > That's indeed not ideal. We could try this if it makes the patches > > > > significantly simpler, though I'm not so sure. > > > > > > > > Allocating metadata is the easier part as we know the correspondence > > > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag > > > > storage page), so alloc_contig_range() does this for us. Just adding it > > > > to the CMA range is sufficient. > > > > > > > > However, making sure that we don't allocate PROT_MTE pages from the > > > > metadata range is what led us to another migrate type. I guess we could > > > > achieve something similar with a new zone or a CPU-less NUMA node, > > > > > > Ideally, no significant core-mm changes to optimize for an architecture > > > oddity. That implies, no new zones and no new migratetypes -- unless it is > > > unavoidable and you are confident that you can convince core-MM people that > > > the use case (giving back 3% of system RAM at max in some setups) is worth > > > the trouble. > > > > If I was an mm maintainer, I'd also question this ;). But vendors seem > > pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a > > 16G platform does look somewhat big). As more and more apps adopt MTE, > > the wastage would be smaller but the first step is getting vendors to > > enable it. > > > > > I also had CPU-less NUMA nodes in mind when thinking about that, but not > > > sure how easy it would be to integrate it. If the tag memory has actually > > > different performance characteristics as well, a NUMA node would be the > > > right choice. > > > > In general I'd expect the same characteristics. However, changing the > > memory designation from tag to data (and vice-versa) requires some cache > > maintenance. The allocation cost is slightly higher (not the runtime > > one), so it would help if the page allocator does not favour this range. > > Anyway, that's an optimisation to worry about later. > > > > > If we could find some way to easily support this either via CMA or CPU-less > > > NUMA nodes, that would be much preferable; even if we cannot cover each and > > > every future use case right now. I expect some issues with CXL+MTE either > > > way , but are happy to be taught otherwise :) > > > > I think CXL+MTE is rather theoretical at the moment. Given that PCIe > > doesn't have any notion of MTE, more likely there would be some piece of > > interconnect that generates two memory accesses: one for data and the > > other for tags at a configurable offset (which may or may not be in the > > same CXL range). > > > > > Another thought I had was adding something like CMA memory characteristics. > > > Like, asking if a given CMA area/page supports tagging (i.e., flag for the > > > CMA area set?)? > > > > I don't think adding CMA memory characteristics helps much. The metadata > > allocation wouldn't go through cma_alloc() but rather > > alloc_contig_range() directly for a specific pfn corresponding to the > > data pages with PROT_MTE. The core mm code doesn't need to know about > > the tag storage layout. > > > > It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE. > > That's typically coming from device drivers (DMA API) with their own > > mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and > > therefore PROT_MTE is rejected). > > > > What we need though is to prevent vma_alloc_folio() from allocating from > > a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically > > removing __GFP_MOVABLE in those cases. As long as we don't have large > > ZONE_MOVABLE areas, it shouldn't be an issue. > > > > How about unsetting ALLOC_CMA if GFP_TAGGED ? > Removing __GFP_MOVABLE may cause movable pages to be allocated in un > unmovable migratetype, which may not be desirable for page fragmentation. Yes, not setting ALLOC_CMA in alloc_flags if __GFP_TAGGED is what I am intending to do. > > > > When you need memory that supports tagging and have a page that does not > > > support tagging (CMA && taggable), simply migrate to !MOVABLE memory > > > (eventually we could also try adding !CMA). > > > > > > Was that discussed and what would be the challenges with that? Page > > > migration due to compaction comes to mind, but it might also be easy to > > > handle if we can just avoid CMA memory for that. > > > > IIRC that was because PROT_MTE pages would have to come only from > > !MOVABLE ranges. Maybe that's not such big deal. > > > > Could you explain what it means that PROT_MTE have to come only from > !MOVABLE range ? I don't understand this part very well. I believe that was with the old approach, where tag storage cannot be tagged. I'm guessing that the idea was that during migration of a tagged page, to make sure that the destination page is not a tag storage page (which cannot be tagged), the gfp flags used for allocating the destination page would be set without __GFP_MOVABLE, which ensures that the destination page is not allocated from MIGRATE_CMA. But that is not needed anymore, if we don't set ALLOC_CMA if __GFP_TAGGED. Thanks, Alex > > Thanks, > Hyesoo. > > > We'll give this a go and hopefully it simplifies the patches a bit (it > > will take a while as Alex keeps going on holiday ;)). In the meantime, > > I'm talking to the hardware people to see whether we can have MTE pages > > in the tag storage/metadata range. We'd still need to reserve about 0.1% > > of the RAM for the metadata corresponding to the tag storage range when > > used as data but that's negligible (1/32 of 1/32). So if some future > > hardware allows this, we can drop the page allocation restriction from > > the CMA range. > > > > > > though the latter is not guaranteed not to allocate memory from the > > > > range, only make it less likely. Both these options are less flexible in > > > > terms of size/alignment/placement. > > > > > > > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and > > > > configure the metadata range in ZONE_MOVABLE but at some point I'd > > > > expect some CXL-attached memory to support MTE with additional carveout > > > > reserved. > > > > > > I have no idea how we could possibly cleanly support memory hotplug in > > > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to > > > s390x storage keys, the approach that arm64 with MTE took here (exposing tag > > > memory to the VM) makes it rather hard and complicated. > > > > The current thinking is that the VM is not aware of the tag storage, > > that's entirely managed by the host. The host would treat the guest > > memory similarly to the PROT_MTE user allocations, reserve metadata etc. > > > > Thanks for the feedback so far, very useful. > > > > -- > > Catalin > >
On Wed, Oct 25, 2023 at 09:47:36AM +0100, Alexandru Elisei wrote: > Hi, > > On Wed, Oct 25, 2023 at 11:59:32AM +0900, Hyesoo Yu wrote: > > On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote: > > > On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote: > > > > On 11.09.23 13:52, Catalin Marinas wrote: > > > > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote: > > > > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote: > > > > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote: > > > > > > > > On 24.08.23 13:06, David Hildenbrand wrote: > > > > > > > > > Regarding one complication: "The kernel needs to know where to allocate > > > > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE > > > > > > > > > (mprotect()) and the range it is in does not support tagging.", > > > > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it > > > > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for > > > > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that). > > > > > > > > > > > > > > > > Okay, I now realize that this patch set effectively duplicates some CMA > > > > > > > > behavior using a new migrate-type. > > > > > [...] > > > > > > I considered mixing the tag storage memory memory with normal memory and > > > > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged, > > > > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation > > > > > > request to use MIGRATE_CMA. > > > > > > > > > > > > I considered two solutions to this problem: > > > > > > > > > > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged => > > > > > > this effectively means transforming all memory from MIGRATE_CMA into the > > > > > > MIGRATE_METADATA migratetype that the series introduces. Not very > > > > > > appealing, because that means treating normal memory that is also on the > > > > > > MIGRATE_CMA lists as tagged memory. > > > > > > > > > > That's indeed not ideal. We could try this if it makes the patches > > > > > significantly simpler, though I'm not so sure. > > > > > > > > > > Allocating metadata is the easier part as we know the correspondence > > > > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag > > > > > storage page), so alloc_contig_range() does this for us. Just adding it > > > > > to the CMA range is sufficient. > > > > > > > > > > However, making sure that we don't allocate PROT_MTE pages from the > > > > > metadata range is what led us to another migrate type. I guess we could > > > > > achieve something similar with a new zone or a CPU-less NUMA node, > > > > > > > > Ideally, no significant core-mm changes to optimize for an architecture > > > > oddity. That implies, no new zones and no new migratetypes -- unless it is > > > > unavoidable and you are confident that you can convince core-MM people that > > > > the use case (giving back 3% of system RAM at max in some setups) is worth > > > > the trouble. > > > > > > If I was an mm maintainer, I'd also question this ;). But vendors seem > > > pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a > > > 16G platform does look somewhat big). As more and more apps adopt MTE, > > > the wastage would be smaller but the first step is getting vendors to > > > enable it. > > > > > > > I also had CPU-less NUMA nodes in mind when thinking about that, but not > > > > sure how easy it would be to integrate it. If the tag memory has actually > > > > different performance characteristics as well, a NUMA node would be the > > > > right choice. > > > > > > In general I'd expect the same characteristics. However, changing the > > > memory designation from tag to data (and vice-versa) requires some cache > > > maintenance. The allocation cost is slightly higher (not the runtime > > > one), so it would help if the page allocator does not favour this range. > > > Anyway, that's an optimisation to worry about later. > > > > > > > If we could find some way to easily support this either via CMA or CPU-less > > > > NUMA nodes, that would be much preferable; even if we cannot cover each and > > > > every future use case right now. I expect some issues with CXL+MTE either > > > > way , but are happy to be taught otherwise :) > > > > > > I think CXL+MTE is rather theoretical at the moment. Given that PCIe > > > doesn't have any notion of MTE, more likely there would be some piece of > > > interconnect that generates two memory accesses: one for data and the > > > other for tags at a configurable offset (which may or may not be in the > > > same CXL range). > > > > > > > Another thought I had was adding something like CMA memory characteristics. > > > > Like, asking if a given CMA area/page supports tagging (i.e., flag for the > > > > CMA area set?)? > > > > > > I don't think adding CMA memory characteristics helps much. The metadata > > > allocation wouldn't go through cma_alloc() but rather > > > alloc_contig_range() directly for a specific pfn corresponding to the > > > data pages with PROT_MTE. The core mm code doesn't need to know about > > > the tag storage layout. > > > > > > It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE. > > > That's typically coming from device drivers (DMA API) with their own > > > mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and > > > therefore PROT_MTE is rejected). > > > > > > What we need though is to prevent vma_alloc_folio() from allocating from > > > a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically > > > removing __GFP_MOVABLE in those cases. As long as we don't have large > > > ZONE_MOVABLE areas, it shouldn't be an issue. > > > > > > > How about unsetting ALLOC_CMA if GFP_TAGGED ? > > Removing __GFP_MOVABLE may cause movable pages to be allocated in un > > unmovable migratetype, which may not be desirable for page fragmentation. > > Yes, not setting ALLOC_CMA in alloc_flags if __GFP_TAGGED is what I am > intending to do. > > > > > > > When you need memory that supports tagging and have a page that does not > > > > support tagging (CMA && taggable), simply migrate to !MOVABLE memory > > > > (eventually we could also try adding !CMA). > > > > > > > > Was that discussed and what would be the challenges with that? Page > > > > migration due to compaction comes to mind, but it might also be easy to > > > > handle if we can just avoid CMA memory for that. > > > > > > IIRC that was because PROT_MTE pages would have to come only from > > > !MOVABLE ranges. Maybe that's not such big deal. > > > > > > > Could you explain what it means that PROT_MTE have to come only from > > !MOVABLE range ? I don't understand this part very well. > > I believe that was with the old approach, where tag storage cannot be tagged. > > I'm guessing that the idea was that during migration of a tagged page, to make > sure that the destination page is not a tag storage page (which cannot be > tagged), the gfp flags used for allocating the destination page would be set > without __GFP_MOVABLE, which ensures that the destination page is not > allocated from MIGRATE_CMA. But that is not needed anymore, if we don't set > ALLOC_CMA if __GFP_TAGGED. > > Thanks, > Alex > Hello, Alex. If we only avoid using ALLOC_CMA for __GFP_TAGGED, would we still be able to use the next iteration even if the hardware does not support "tag of tag" ? I am not sure every vendor will support tag of tag, since there is no information related to that feature, like in the Google spec document. we are also looking into this. Thanks, Regards. > > > > Thanks, > > Hyesoo. > > > > > We'll give this a go and hopefully it simplifies the patches a bit (it > > > will take a while as Alex keeps going on holiday ;)). In the meantime, > > > I'm talking to the hardware people to see whether we can have MTE pages > > > in the tag storage/metadata range. We'd still need to reserve about 0.1% > > > of the RAM for the metadata corresponding to the tag storage range when > > > used as data but that's negligible (1/32 of 1/32). So if some future > > > hardware allows this, we can drop the page allocation restriction from > > > the CMA range. > > > > > > > > though the latter is not guaranteed not to allocate memory from the > > > > > range, only make it less likely. Both these options are less flexible in > > > > > terms of size/alignment/placement. > > > > > > > > > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and > > > > > configure the metadata range in ZONE_MOVABLE but at some point I'd > > > > > expect some CXL-attached memory to support MTE with additional carveout > > > > > reserved. > > > > > > > > I have no idea how we could possibly cleanly support memory hotplug in > > > > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to > > > > s390x storage keys, the approach that arm64 with MTE took here (exposing tag > > > > memory to the VM) makes it rather hard and complicated. > > > > > > The current thinking is that the VM is not aware of the tag storage, > > > that's entirely managed by the host. The host would treat the guest > > > memory similarly to the PROT_MTE user allocations, reserve metadata etc. > > > > > > Thanks for the feedback so far, very useful. > > > > > > -- > > > Catalin > > > > > >
On Wed, Oct 25, 2023 at 05:52:58PM +0900, Hyesoo Yu wrote: > If we only avoid using ALLOC_CMA for __GFP_TAGGED, would we still be able to use > the next iteration even if the hardware does not support "tag of tag" ? It depends on how the next iteration looks like. The plan was not to support this so that we avoid another complication where a non-tagged page is mprotect'ed to become tagged and it would need to be migrated out of the CMA range. Not sure how much code it would save. > I am not sure every vendor will support tag of tag, since there is no information > related to that feature, like in the Google spec document. If you are aware of any vendors not supporting this, please direct them to the Arm support team, it would be very useful information for us. Thanks.
diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts index 60472d65a355..bd050373d6cf 100644 --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts @@ -165,10 +165,28 @@ C1_L2: l2-cache1 { }; }; - memory@80000000 { + memory0: memory@80000000 { device_type = "memory"; - reg = <0x00000000 0x80000000 0 0x80000000>, - <0x00000008 0x80000000 0 0x80000000>; + reg = <0x00 0x80000000 0x00 0x7c000000>; + }; + + metadata0: metadata@c0000000 { + compatible = "arm,mte-tag-storage"; + reg = <0x00 0xfc000000 0x00 0x3e00000>; + block-size = <0x1000>; + memory = <&memory0>; + }; + + memory1: memory@880000000 { + device_type = "memory"; + reg = <0x08 0x80000000 0x00 0x7c000000>; + }; + + metadata1: metadata@8c0000000 { + compatible = "arm,mte-tag-storage"; + reg = <0x08 0xfc000000 0x00 0x3e00000>; + block-size = <0x1000>; + memory = <&memory1>; }; reserved-memory {