diff mbox series

tdx, memory hotplug: Check whole hot-adding memory range for TDX

Message ID 20240930055112.344206-1-ying.huang@intel.com (mailing list archive)
State New
Headers show
Series tdx, memory hotplug: Check whole hot-adding memory range for TDX | expand

Commit Message

Huang, Ying Sept. 30, 2024, 5:51 a.m. UTC
On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
hot-added must be checked for compatibility by TDX.  This is currently
implemented through memory hotplug notifiers for each memory_block.
If a memory range which isn't TDX compatible is hot-added, for
example, some CXL memory, the command line as follows,

  $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online

will report something like,

  bash: echo: write error: Operation not permitted

If pr_debug() is enabled, the error message like below will be shown
in the kernel log,

  online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed

Both are too general to root cause the problem.  This will confuse
users.  One solution is to print some error messages in the TDX memory
hotplug notifier.  However, memory hotplug notifiers are called for
each memory block, so this may lead to a large volume of messages in
the kernel log if a large number of memory blocks are onlined with a
script or automatically.  For example, the typical size of memory
block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
will be logged.

Therefore, in this patch, the whole hot-adding memory range is checked
for TDX compatibility through a newly added architecture specific
function (arch_check_hotplug_memory_range()).  If rejected, the memory
hot-adding will be aborted with a proper kernel log message.  Which
looks like something as below,

  virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX for TDX compatibility.

The target use case is to support CXL memory on TDX enabled systems.
If the CXL memory isn't compatible with TDX, the whole CXL memory
range hot-adding will be rejected.  While the CXL memory can still be
used via devdax interface.

This also makes the original TDX memory hotplug notifier useless, so
delete it.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
---
 arch/x86/include/asm/tdx.h     |  2 ++
 arch/x86/mm/init_64.c          |  6 ++++++
 arch/x86/virt/vmx/tdx/tdx.c    | 35 ++++++++++++----------------------
 include/linux/memory_hotplug.h |  3 +++
 mm/memory_hotplug.c            |  7 ++++++-
 5 files changed, 29 insertions(+), 24 deletions(-)

Comments

David Hildenbrand Sept. 30, 2024, 8:58 a.m. UTC | #1
On 30.09.24 07:51, Huang Ying wrote:
> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
> hot-added must be checked for compatibility by TDX.  This is currently
> implemented through memory hotplug notifiers for each memory_block.
> If a memory range which isn't TDX compatible is hot-added, for
> example, some CXL memory, the command line as follows,
> 
>    $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
> 
> will report something like,
> 
>    bash: echo: write error: Operation not permitted
> 
> If pr_debug() is enabled, the error message like below will be shown
> in the kernel log,
> 
>    online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
> 
> Both are too general to root cause the problem.  This will confuse
> users.  One solution is to print some error messages in the TDX memory
> hotplug notifier.  However, memory hotplug notifiers are called for
> each memory block, so this may lead to a large volume of messages in
> the kernel log if a large number of memory blocks are onlined with a
> script or automatically.  For example, the typical size of memory
> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
> will be logged.

ratelimiting would likely help here a lot, but I agree that it is 
suboptimal.

> 
> Therefore, in this patch, the whole hot-adding memory range is checked
> for TDX compatibility through a newly added architecture specific
> function (arch_check_hotplug_memory_range()).  If rejected, the memory
> hot-adding will be aborted with a proper kernel log message.  Which
> looks like something as below,
> 
>    virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX for TDX compatibility.
 > > The target use case is to support CXL memory on TDX enabled systems.
> If the CXL memory isn't compatible with TDX, the whole CXL memory
> range hot-adding will be rejected.  While the CXL memory can still be
> used via devdax interface.

I'm curious, why can that memory be used through devdax but not through 
the buddy? I'm probably missing something important :)

> 
> This also makes the original TDX memory hotplug notifier useless, so
> delete it.

The online-notifier would even be too late when used with the 
memmap-on-memory feature I assume, as we might be touching that memory 
even before being able to call memory online notifiers.

One way to handle that would be to switch to the MEM_PREPARE_ONLINE 
notifier, but it's still called per-memory block.

Nothing jumped at me, so

Acked-by: David Hildenbrand <david@redhat.com>
Huang, Ying Sept. 30, 2024, 11:51 p.m. UTC | #2
Hi, David,

Thanks a lot for comments!

David Hildenbrand <david@redhat.com> writes:

> On 30.09.24 07:51, Huang Ying wrote:
>> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
>> hot-added must be checked for compatibility by TDX.  This is currently
>> implemented through memory hotplug notifiers for each memory_block.
>> If a memory range which isn't TDX compatible is hot-added, for
>> example, some CXL memory, the command line as follows,
>>    $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
>> will report something like,
>>    bash: echo: write error: Operation not permitted
>> If pr_debug() is enabled, the error message like below will be shown
>> in the kernel log,
>>    online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
>> Both are too general to root cause the problem.  This will confuse
>> users.  One solution is to print some error messages in the TDX memory
>> hotplug notifier.  However, memory hotplug notifiers are called for
>> each memory block, so this may lead to a large volume of messages in
>> the kernel log if a large number of memory blocks are onlined with a
>> script or automatically.  For example, the typical size of memory
>> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
>> will be logged.
>
> ratelimiting would likely help here a lot, but I agree that it is
> suboptimal.
>
>> Therefore, in this patch, the whole hot-adding memory range is
>> checked
>> for TDX compatibility through a newly added architecture specific
>> function (arch_check_hotplug_memory_range()).  If rejected, the memory
>> hot-adding will be aborted with a proper kernel log message.  Which
>> looks like something as below,
>>    virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX
>> for TDX compatibility.
>> > The target use case is to support CXL memory on TDX enabled systems.
>> If the CXL memory isn't compatible with TDX, the whole CXL memory
>> range hot-adding will be rejected.  While the CXL memory can still be
>> used via devdax interface.
>
> I'm curious, why can that memory be used through devdax but not
> through the buddy? I'm probably missing something important :)

Because only TDX compatible memory can be used for TDX guest.  The buddy
is used to allocate memory for TDX guest.  While devdax will not be used
for that.

>> This also makes the original TDX memory hotplug notifier useless, so
>> delete it.
>
> The online-notifier would even be too late when used with the
> memmap-on-memory feature I assume, as we might be touching that memory
> even before being able to call memory online notifiers.

This should be OK.  Because we will not use the memory for TDX guest in
this way.

> One way to handle that would be to switch to the MEM_PREPARE_ONLINE
> notifier, but it's still called per-memory block.
>
> Nothing jumped at me, so
>
> Acked-by: David Hildenbrand <david@redhat.com>

Thank you very much!

--
Best Regards,
Huang, Ying
Dan Williams Oct. 1, 2024, 6:45 a.m. UTC | #3
David Hildenbrand wrote:
> On 30.09.24 07:51, Huang Ying wrote:
> > On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
> > hot-added must be checked for compatibility by TDX.  This is currently
> > implemented through memory hotplug notifiers for each memory_block.
> > If a memory range which isn't TDX compatible is hot-added, for
> > example, some CXL memory, the command line as follows,
> > 
> >    $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
> > 
> > will report something like,
> > 
> >    bash: echo: write error: Operation not permitted
> > 
> > If pr_debug() is enabled, the error message like below will be shown
> > in the kernel log,
> > 
> >    online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
> > 
> > Both are too general to root cause the problem.  This will confuse
> > users.  One solution is to print some error messages in the TDX memory
> > hotplug notifier.  However, memory hotplug notifiers are called for
> > each memory block, so this may lead to a large volume of messages in
> > the kernel log if a large number of memory blocks are onlined with a
> > script or automatically.  For example, the typical size of memory
> > block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
> > will be logged.
> 
> ratelimiting would likely help here a lot, but I agree that it is 
> suboptimal.
> 
> > 
> > Therefore, in this patch, the whole hot-adding memory range is checked
> > for TDX compatibility through a newly added architecture specific
> > function (arch_check_hotplug_memory_range()).  If rejected, the memory
> > hot-adding will be aborted with a proper kernel log message.  Which
> > looks like something as below,
> > 
> >    virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX for TDX compatibility.
>  > > The target use case is to support CXL memory on TDX enabled systems.
> > If the CXL memory isn't compatible with TDX, the whole CXL memory
> > range hot-adding will be rejected.  While the CXL memory can still be
> > used via devdax interface.
> 
> I'm curious, why can that memory be used through devdax but not through 
> the buddy? I'm probably missing something important :)

TDX requires memory that supports integrity and encryption. Until
platforms and expanders with a technology called CXL TSP arrives, CXL
memory is not able to join the TCB.

The TDX code for simplicity assumes that only memory present at boot
might be capable of TDX and that everything else is not.

Confidential VMs use guest_mem_fd to allocate memory, and that only
pulls from the page allocator as a backend.

This ability to use devdax in an offline mode is a hack to not
completely strand memory, but the practical expectation is that one does
not deploy CXL on a platform that will use TDX at least until this CXL
TSP capability arrives with future generation hardware.
David Hildenbrand Oct. 1, 2024, 7:54 a.m. UTC | #4
> Because only TDX compatible memory can be used for TDX guest.  The buddy
> is used to allocate memory for TDX guest.  While devdax will not be used
> for that.
> 
>>> This also makes the original TDX memory hotplug notifier useless, so
>>> delete it.
>>
>> The online-notifier would even be too late when used with the
>> memmap-on-memory feature I assume, as we might be touching that memory
>> even before being able to call memory online notifiers.
> 
> This should be OK.  Because we will not use the memory for TDX guest in
> this way.

Thanks for the reminder, I keep assuming that we are hotplugging memory 
into the guest, not the hypervisor.

Having that as a comment in tdx_check_hotplug_memory_range() would be 
helpful: we don't allow mixture of TDX and !TDX memory in the buddy so 
we won't run into trouble when launching encrypted VMs that really need 
TDX-capable memory.
David Hildenbrand Oct. 1, 2024, 7:56 a.m. UTC | #5
On 01.10.24 08:45, Dan Williams wrote:
> David Hildenbrand wrote:
>> On 30.09.24 07:51, Huang Ying wrote:
>>> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
>>> hot-added must be checked for compatibility by TDX.  This is currently
>>> implemented through memory hotplug notifiers for each memory_block.
>>> If a memory range which isn't TDX compatible is hot-added, for
>>> example, some CXL memory, the command line as follows,
>>>
>>>     $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
>>>
>>> will report something like,
>>>
>>>     bash: echo: write error: Operation not permitted
>>>
>>> If pr_debug() is enabled, the error message like below will be shown
>>> in the kernel log,
>>>
>>>     online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
>>>
>>> Both are too general to root cause the problem.  This will confuse
>>> users.  One solution is to print some error messages in the TDX memory
>>> hotplug notifier.  However, memory hotplug notifiers are called for
>>> each memory block, so this may lead to a large volume of messages in
>>> the kernel log if a large number of memory blocks are onlined with a
>>> script or automatically.  For example, the typical size of memory
>>> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
>>> will be logged.
>>
>> ratelimiting would likely help here a lot, but I agree that it is
>> suboptimal.
>>
>>>
>>> Therefore, in this patch, the whole hot-adding memory range is checked
>>> for TDX compatibility through a newly added architecture specific
>>> function (arch_check_hotplug_memory_range()).  If rejected, the memory
>>> hot-adding will be aborted with a proper kernel log message.  Which
>>> looks like something as below,
>>>
>>>     virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX for TDX compatibility.
>>   > > The target use case is to support CXL memory on TDX enabled systems.
>>> If the CXL memory isn't compatible with TDX, the whole CXL memory
>>> range hot-adding will be rejected.  While the CXL memory can still be
>>> used via devdax interface.
>>
>> I'm curious, why can that memory be used through devdax but not through
>> the buddy? I'm probably missing something important :)
> 
> TDX requires memory that supports integrity and encryption. Until
> platforms and expanders with a technology called CXL TSP arrives, CXL
> memory is not able to join the TCB.
> 
> The TDX code for simplicity assumes that only memory present at boot
> might be capable of TDX and that everything else is not.

So is there ever a chance where add_memory() would actually work now 
with TDX? Or can we just simplify and unconditionally reject 
add_memory() if TDX is enabled?

> 
> Confidential VMs use guest_mem_fd to allocate memory, and that only
> pulls from the page allocator as a backend.
> 
> This ability to use devdax in an offline mode is a hack to not

Thanks, I was missing the "hack" of it, and somehow (once again) assumed 
that we would be hotplugging memory into confidential VMs.
Dan Williams Oct. 1, 2024, 8:08 a.m. UTC | #6
David Hildenbrand wrote:
> On 01.10.24 08:45, Dan Williams wrote:
> > David Hildenbrand wrote:
> >> On 30.09.24 07:51, Huang Ying wrote:
> >>> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
> >>> hot-added must be checked for compatibility by TDX.  This is currently
> >>> implemented through memory hotplug notifiers for each memory_block.
> >>> If a memory range which isn't TDX compatible is hot-added, for
> >>> example, some CXL memory, the command line as follows,
> >>>
> >>>     $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
> >>>
> >>> will report something like,
> >>>
> >>>     bash: echo: write error: Operation not permitted
> >>>
> >>> If pr_debug() is enabled, the error message like below will be shown
> >>> in the kernel log,
> >>>
> >>>     online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
> >>>
> >>> Both are too general to root cause the problem.  This will confuse
> >>> users.  One solution is to print some error messages in the TDX memory
> >>> hotplug notifier.  However, memory hotplug notifiers are called for
> >>> each memory block, so this may lead to a large volume of messages in
> >>> the kernel log if a large number of memory blocks are onlined with a
> >>> script or automatically.  For example, the typical size of memory
> >>> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
> >>> will be logged.
> >>
> >> ratelimiting would likely help here a lot, but I agree that it is
> >> suboptimal.
> >>
> >>>
> >>> Therefore, in this patch, the whole hot-adding memory range is checked
> >>> for TDX compatibility through a newly added architecture specific
> >>> function (arch_check_hotplug_memory_range()).  If rejected, the memory
> >>> hot-adding will be aborted with a proper kernel log message.  Which
> >>> looks like something as below,
> >>>
> >>>     virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX for TDX compatibility.
> >>   > > The target use case is to support CXL memory on TDX enabled systems.
> >>> If the CXL memory isn't compatible with TDX, the whole CXL memory
> >>> range hot-adding will be rejected.  While the CXL memory can still be
> >>> used via devdax interface.
> >>
> >> I'm curious, why can that memory be used through devdax but not through
> >> the buddy? I'm probably missing something important :)
> > 
> > TDX requires memory that supports integrity and encryption. Until
> > platforms and expanders with a technology called CXL TSP arrives, CXL
> > memory is not able to join the TCB.
> > 
> > The TDX code for simplicity assumes that only memory present at boot
> > might be capable of TDX and that everything else is not.
> 
> So is there ever a chance where add_memory() would actually work now 
> with TDX? Or can we just simplify and unconditionally reject 
> add_memory() if TDX is enabled?

Only if the memory address range is enumerated by the platform firmware
(mcheck) at boot time.

This will eventually be possible with the CXL dynamic-capacity (DCD)
capability once CXL TSP arrives. In that scenario the CXL DCD expander
is brought into the TCB at boot time and assigned a fixed address range
where future memory could arrive. I.e. the CXL device is brought into
the TCB at boot, but the memory it provides can arrive later.

> > Confidential VMs use guest_mem_fd to allocate memory, and that only
> > pulls from the page allocator as a backend.
> > 
> > This ability to use devdax in an offline mode is a hack to not
> 
> Thanks, I was missing the "hack" of it, and somehow (once again) assumed 
> that we would be hotplugging memory into confidential VMs.

When / if dynamic capacity and this security-protocol for CXL arrives
that may yet happen. For now it is safe to block adding anything which
mcheck does not like which is everything but memory present at boot
(is_tdx_memory()).
Yang Shi Oct. 3, 2024, 8:39 p.m. UTC | #7
On Mon, Sep 30, 2024 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, David,
>
> Thanks a lot for comments!
>
> David Hildenbrand <david@redhat.com> writes:
>
> > On 30.09.24 07:51, Huang Ying wrote:
> >> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
> >> hot-added must be checked for compatibility by TDX.  This is currently
> >> implemented through memory hotplug notifiers for each memory_block.
> >> If a memory range which isn't TDX compatible is hot-added, for
> >> example, some CXL memory, the command line as follows,
> >>    $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
> >> will report something like,
> >>    bash: echo: write error: Operation not permitted
> >> If pr_debug() is enabled, the error message like below will be shown
> >> in the kernel log,
> >>    online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
> >> Both are too general to root cause the problem.  This will confuse
> >> users.  One solution is to print some error messages in the TDX memory
> >> hotplug notifier.  However, memory hotplug notifiers are called for
> >> each memory block, so this may lead to a large volume of messages in
> >> the kernel log if a large number of memory blocks are onlined with a
> >> script or automatically.  For example, the typical size of memory
> >> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
> >> will be logged.
> >
> > ratelimiting would likely help here a lot, but I agree that it is
> > suboptimal.
> >
> >> Therefore, in this patch, the whole hot-adding memory range is
> >> checked
> >> for TDX compatibility through a newly added architecture specific
> >> function (arch_check_hotplug_memory_range()).  If rejected, the memory
> >> hot-adding will be aborted with a proper kernel log message.  Which
> >> looks like something as below,
> >>    virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX
> >> for TDX compatibility.
> >> > The target use case is to support CXL memory on TDX enabled systems.
> >> If the CXL memory isn't compatible with TDX, the whole CXL memory
> >> range hot-adding will be rejected.  While the CXL memory can still be
> >> used via devdax interface.
> >
> > I'm curious, why can that memory be used through devdax but not
> > through the buddy? I'm probably missing something important :)
>
> Because only TDX compatible memory can be used for TDX guest.  The buddy
> is used to allocate memory for TDX guest.  While devdax will not be used
> for that.

Sorry for chiming in late. I think CXL also faces the similar problem
on the platform with MTE (memory tagging extension on ARM64). AFAIK,
we can't have MTE on CXL, so CXL has to stay as dax device if MTE is
enabled.

We should need a similar mechanism to prevent users from hot-adding
CXL memory if MTE is on. But not like TDX I don't think we have a
simple way to tell whether the pfn belongs to CXL or not. Please
correct me if I'm wrong. I'm wondering whether we can find a more
common way to tell memory hotplug to not hot-add some region. For
example, a special flag in struct resource. off the top of my head.

No solid idea yet, I'm definitely seeking some advice.

>
> >> This also makes the original TDX memory hotplug notifier useless, so
> >> delete it.
> >
> > The online-notifier would even be too late when used with the
> > memmap-on-memory feature I assume, as we might be touching that memory
> > even before being able to call memory online notifiers.
>
> This should be OK.  Because we will not use the memory for TDX guest in
> this way.
>
> > One way to handle that would be to switch to the MEM_PREPARE_ONLINE
> > notifier, but it's still called per-memory block.
> >
> > Nothing jumped at me, so
> >
> > Acked-by: David Hildenbrand <david@redhat.com>
>
> Thank you very much!
>
> --
> Best Regards,
> Huang, Ying
>
Dan Williams Oct. 3, 2024, 11:32 p.m. UTC | #8
Yang Shi wrote:
> On Mon, Sep 30, 2024 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Hi, David,
> >
> > Thanks a lot for comments!
> >
> > David Hildenbrand <david@redhat.com> writes:
> >
> > > On 30.09.24 07:51, Huang Ying wrote:
> > >> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
> > >> hot-added must be checked for compatibility by TDX.  This is currently
> > >> implemented through memory hotplug notifiers for each memory_block.
> > >> If a memory range which isn't TDX compatible is hot-added, for
> > >> example, some CXL memory, the command line as follows,
> > >>    $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
> > >> will report something like,
> > >>    bash: echo: write error: Operation not permitted
> > >> If pr_debug() is enabled, the error message like below will be shown
> > >> in the kernel log,
> > >>    online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
> > >> Both are too general to root cause the problem.  This will confuse
> > >> users.  One solution is to print some error messages in the TDX memory
> > >> hotplug notifier.  However, memory hotplug notifiers are called for
> > >> each memory block, so this may lead to a large volume of messages in
> > >> the kernel log if a large number of memory blocks are onlined with a
> > >> script or automatically.  For example, the typical size of memory
> > >> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
> > >> will be logged.
> > >
> > > ratelimiting would likely help here a lot, but I agree that it is
> > > suboptimal.
> > >
> > >> Therefore, in this patch, the whole hot-adding memory range is
> > >> checked
> > >> for TDX compatibility through a newly added architecture specific
> > >> function (arch_check_hotplug_memory_range()).  If rejected, the memory
> > >> hot-adding will be aborted with a proper kernel log message.  Which
> > >> looks like something as below,
> > >>    virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX
> > >> for TDX compatibility.
> > >> > The target use case is to support CXL memory on TDX enabled systems.
> > >> If the CXL memory isn't compatible with TDX, the whole CXL memory
> > >> range hot-adding will be rejected.  While the CXL memory can still be
> > >> used via devdax interface.
> > >
> > > I'm curious, why can that memory be used through devdax but not
> > > through the buddy? I'm probably missing something important :)
> >
> > Because only TDX compatible memory can be used for TDX guest.  The buddy
> > is used to allocate memory for TDX guest.  While devdax will not be used
> > for that.
> 
> Sorry for chiming in late. I think CXL also faces the similar problem
> on the platform with MTE (memory tagging extension on ARM64). AFAIK,
> we can't have MTE on CXL, so CXL has to stay as dax device if MTE is
> enabled.
> 
> We should need a similar mechanism to prevent users from hot-adding
> CXL memory if MTE is on. But not like TDX I don't think we have a
> simple way to tell whether the pfn belongs to CXL or not. Please
> correct me if I'm wrong. I'm wondering whether we can find a more
> common way to tell memory hotplug to not hot-add some region. For
> example, a special flag in struct resource. off the top of my head.
> 
> No solid idea yet, I'm definitely seeking some advice.

Could the ARM version of arch_check_hotplug_memory_range() check if MTE
is enabled in the CPU and then ask the CXL subsystem if the address range is
backed by a topology that supports MTE?

However, why would it be ok to access CXL memory without MTE via devdax,
but not as online page allocator memory?

If the goal is to simply deny any and all non-MTE supported CXL region
from attaching then that could probably be handled as a modification to
the "cxl_acpi" driver to deny region creation unless it supports
everything the CPU expects from "memory".
Yang Shi Oct. 4, 2024, 1:10 a.m. UTC | #9
On Thu, Oct 3, 2024 at 4:32 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Yang Shi wrote:
> > On Mon, Sep 30, 2024 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >
> > > Hi, David,
> > >
> > > Thanks a lot for comments!
> > >
> > > David Hildenbrand <david@redhat.com> writes:
> > >
> > > > On 30.09.24 07:51, Huang Ying wrote:
> > > >> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
> > > >> hot-added must be checked for compatibility by TDX.  This is currently
> > > >> implemented through memory hotplug notifiers for each memory_block.
> > > >> If a memory range which isn't TDX compatible is hot-added, for
> > > >> example, some CXL memory, the command line as follows,
> > > >>    $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
> > > >> will report something like,
> > > >>    bash: echo: write error: Operation not permitted
> > > >> If pr_debug() is enabled, the error message like below will be shown
> > > >> in the kernel log,
> > > >>    online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
> > > >> Both are too general to root cause the problem.  This will confuse
> > > >> users.  One solution is to print some error messages in the TDX memory
> > > >> hotplug notifier.  However, memory hotplug notifiers are called for
> > > >> each memory block, so this may lead to a large volume of messages in
> > > >> the kernel log if a large number of memory blocks are onlined with a
> > > >> script or automatically.  For example, the typical size of memory
> > > >> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
> > > >> will be logged.
> > > >
> > > > ratelimiting would likely help here a lot, but I agree that it is
> > > > suboptimal.
> > > >
> > > >> Therefore, in this patch, the whole hot-adding memory range is
> > > >> checked
> > > >> for TDX compatibility through a newly added architecture specific
> > > >> function (arch_check_hotplug_memory_range()).  If rejected, the memory
> > > >> hot-adding will be aborted with a proper kernel log message.  Which
> > > >> looks like something as below,
> > > >>    virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX
> > > >> for TDX compatibility.
> > > >> > The target use case is to support CXL memory on TDX enabled systems.
> > > >> If the CXL memory isn't compatible with TDX, the whole CXL memory
> > > >> range hot-adding will be rejected.  While the CXL memory can still be
> > > >> used via devdax interface.
> > > >
> > > > I'm curious, why can that memory be used through devdax but not
> > > > through the buddy? I'm probably missing something important :)
> > >
> > > Because only TDX compatible memory can be used for TDX guest.  The buddy
> > > is used to allocate memory for TDX guest.  While devdax will not be used
> > > for that.
> >
> > Sorry for chiming in late. I think CXL also faces the similar problem
> > on the platform with MTE (memory tagging extension on ARM64). AFAIK,
> > we can't have MTE on CXL, so CXL has to stay as dax device if MTE is
> > enabled.
> >
> > We should need a similar mechanism to prevent users from hot-adding
> > CXL memory if MTE is on. But not like TDX I don't think we have a
> > simple way to tell whether the pfn belongs to CXL or not. Please
> > correct me if I'm wrong. I'm wondering whether we can find a more
> > common way to tell memory hotplug to not hot-add some region. For
> > example, a special flag in struct resource. off the top of my head.
> >
> > No solid idea yet, I'm definitely seeking some advice.
>
> Could the ARM version of arch_check_hotplug_memory_range() check if MTE
> is enabled in the CPU and then ask the CXL subsystem if the address range is
> backed by a topology that supports MTE?

Kernel can tell whether MTE is really enabled. For the CXL part, IIUC
that relies on the CXL subsystem is able to tell whether that range
can support MTE or not, right? Or CXL subsystem tells us whether the
range is CXL memory range or not, then we can just refuse MTE for all
CXL regions for now. Does CXL support this now?

>
> However, why would it be ok to access CXL memory without MTE via devdax,
> but not as online page allocator memory?

CXL memory can be onlined as system ram as long as MTE is not enabled.
It just can be used as devdax device if MTE is enabled.

>
> If the goal is to simply deny any and all non-MTE supported CXL region
> from attaching then that could probably be handled as a modification to
> the "cxl_acpi" driver to deny region creation unless it supports
> everything the CPU expects from "memory".

I'm not quite familiar with the details in CXL driver. What did you
mean "deny region creation"? As long as the CXL memory still can be
used as devdax device, it should be fine.
Dan Williams Oct. 4, 2024, 3:15 a.m. UTC | #10
Yang Shi wrote:
> On Thu, Oct 3, 2024 at 4:32 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Yang Shi wrote:
> > > On Mon, Sep 30, 2024 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> > > >
> > > > Hi, David,
> > > >
> > > > Thanks a lot for comments!
> > > >
> > > > David Hildenbrand <david@redhat.com> writes:
> > > >
> > > > > On 30.09.24 07:51, Huang Ying wrote:
> > > > >> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
> > > > >> hot-added must be checked for compatibility by TDX.  This is currently
> > > > >> implemented through memory hotplug notifiers for each memory_block.
> > > > >> If a memory range which isn't TDX compatible is hot-added, for
> > > > >> example, some CXL memory, the command line as follows,
> > > > >>    $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
> > > > >> will report something like,
> > > > >>    bash: echo: write error: Operation not permitted
> > > > >> If pr_debug() is enabled, the error message like below will be shown
> > > > >> in the kernel log,
> > > > >>    online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
> > > > >> Both are too general to root cause the problem.  This will confuse
> > > > >> users.  One solution is to print some error messages in the TDX memory
> > > > >> hotplug notifier.  However, memory hotplug notifiers are called for
> > > > >> each memory block, so this may lead to a large volume of messages in
> > > > >> the kernel log if a large number of memory blocks are onlined with a
> > > > >> script or automatically.  For example, the typical size of memory
> > > > >> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
> > > > >> will be logged.
> > > > >
> > > > > ratelimiting would likely help here a lot, but I agree that it is
> > > > > suboptimal.
> > > > >
> > > > >> Therefore, in this patch, the whole hot-adding memory range is
> > > > >> checked
> > > > >> for TDX compatibility through a newly added architecture specific
> > > > >> function (arch_check_hotplug_memory_range()).  If rejected, the memory
> > > > >> hot-adding will be aborted with a proper kernel log message.  Which
> > > > >> looks like something as below,
> > > > >>    virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX
> > > > >> for TDX compatibility.
> > > > >> > The target use case is to support CXL memory on TDX enabled systems.
> > > > >> If the CXL memory isn't compatible with TDX, the whole CXL memory
> > > > >> range hot-adding will be rejected.  While the CXL memory can still be
> > > > >> used via devdax interface.
> > > > >
> > > > > I'm curious, why can that memory be used through devdax but not
> > > > > through the buddy? I'm probably missing something important :)
> > > >
> > > > Because only TDX compatible memory can be used for TDX guest.  The buddy
> > > > is used to allocate memory for TDX guest.  While devdax will not be used
> > > > for that.
> > >
> > > Sorry for chiming in late. I think CXL also faces the similar problem
> > > on the platform with MTE (memory tagging extension on ARM64). AFAIK,
> > > we can't have MTE on CXL, so CXL has to stay as dax device if MTE is
> > > enabled.
> > >
> > > We should need a similar mechanism to prevent users from hot-adding
> > > CXL memory if MTE is on. But not like TDX I don't think we have a
> > > simple way to tell whether the pfn belongs to CXL or not. Please
> > > correct me if I'm wrong. I'm wondering whether we can find a more
> > > common way to tell memory hotplug to not hot-add some region. For
> > > example, a special flag in struct resource. off the top of my head.
> > >
> > > No solid idea yet, I'm definitely seeking some advice.
> >
> > Could the ARM version of arch_check_hotplug_memory_range() check if MTE
> > is enabled in the CPU and then ask the CXL subsystem if the address range is
> > backed by a topology that supports MTE?
> 
> Kernel can tell whether MTE is really enabled. For the CXL part, IIUC
> that relies on the CXL subsystem is able to tell whether that range
> can support MTE or not, right? Or CXL subsystem tells us whether the
> range is CXL memory range or not, then we can just refuse MTE for all
> CXL regions for now. Does CXL support this now?

So the CXL specification has section:

    8.2.4.31 CXL Extended Metadata Capability Register

...that indicates if the device supports "Extended Metadata" (EMD).
However, the CXL specification does not talk about how a given hosts
uses the extended metadata capabilities of a device. That detail would
need to come from an ARM platform specification.

Currently CXL subsystem does nothing with this since there has been no
need to date, but I would expect someone from the ARM side to plumb this
detection into the CXL subsystem.

> > However, why would it be ok to access CXL memory without MTE via devdax,
> > but not as online page allocator memory?
> 
> CXL memory can be onlined as system ram as long as MTE is not enabled.
> It just can be used as devdax device if MTE is enabled.

Do you mean the kernel only manages MTE for kernel pages, but with user
mapped memory the application will need to implicitly know that
memory-tagging is not available?

I worry about applications that might not know that their heap is coming
from a userspace memory allocator backed by device-dax rather than the
kernel.

> > If the goal is to simply deny any and all non-MTE supported CXL region
> > from attaching then that could probably be handled as a modification to
> > the "cxl_acpi" driver to deny region creation unless it supports
> > everything the CPU expects from "memory".
> 
> I'm not quite familiar with the details in CXL driver. What did you
> mean "deny region creation"? As long as the CXL memory still can be
> used as devdax device, it should be fine.

Meaning that the CXL subsytem knows how to, for a given address range, figure
out the members and geometry of the CXL devices that contribute to that
range (CXL region). It would be straightforward to add EMD to that
enumeration and flag the CXL region as not online-capable if the CPU has
MTE enabled but no EMD capability.
David Hildenbrand Oct. 4, 2024, 10:16 a.m. UTC | #11
On 01.10.24 10:08, Dan Williams wrote:
> David Hildenbrand wrote:
>> On 01.10.24 08:45, Dan Williams wrote:
>>> David Hildenbrand wrote:
>>>> On 30.09.24 07:51, Huang Ying wrote:
>>>>> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
>>>>> hot-added must be checked for compatibility by TDX.  This is currently
>>>>> implemented through memory hotplug notifiers for each memory_block.
>>>>> If a memory range which isn't TDX compatible is hot-added, for
>>>>> example, some CXL memory, the command line as follows,
>>>>>
>>>>>      $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
>>>>>
>>>>> will report something like,
>>>>>
>>>>>      bash: echo: write error: Operation not permitted
>>>>>
>>>>> If pr_debug() is enabled, the error message like below will be shown
>>>>> in the kernel log,
>>>>>
>>>>>      online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
>>>>>
>>>>> Both are too general to root cause the problem.  This will confuse
>>>>> users.  One solution is to print some error messages in the TDX memory
>>>>> hotplug notifier.  However, memory hotplug notifiers are called for
>>>>> each memory block, so this may lead to a large volume of messages in
>>>>> the kernel log if a large number of memory blocks are onlined with a
>>>>> script or automatically.  For example, the typical size of memory
>>>>> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
>>>>> will be logged.
>>>>
>>>> ratelimiting would likely help here a lot, but I agree that it is
>>>> suboptimal.
>>>>
>>>>>
>>>>> Therefore, in this patch, the whole hot-adding memory range is checked
>>>>> for TDX compatibility through a newly added architecture specific
>>>>> function (arch_check_hotplug_memory_range()).  If rejected, the memory
>>>>> hot-adding will be aborted with a proper kernel log message.  Which
>>>>> looks like something as below,
>>>>>
>>>>>      virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX for TDX compatibility.
>>>>    > > The target use case is to support CXL memory on TDX enabled systems.
>>>>> If the CXL memory isn't compatible with TDX, the whole CXL memory
>>>>> range hot-adding will be rejected.  While the CXL memory can still be
>>>>> used via devdax interface.
>>>>
>>>> I'm curious, why can that memory be used through devdax but not through
>>>> the buddy? I'm probably missing something important :)
>>>
>>> TDX requires memory that supports integrity and encryption. Until
>>> platforms and expanders with a technology called CXL TSP arrives, CXL
>>> memory is not able to join the TCB.
>>>
>>> The TDX code for simplicity assumes that only memory present at boot
>>> might be capable of TDX and that everything else is not.
>>
>> So is there ever a chance where add_memory() would actually work now
>> with TDX? Or can we just simplify and unconditionally reject
>> add_memory() if TDX is enabled?
> 
> Only if the memory address range is enumerated by the platform firmware
> (mcheck) at boot time.
> 
> This will eventually be possible with the CXL dynamic-capacity (DCD)
> capability once CXL TSP arrives. In that scenario the CXL DCD expander
> is brought into the TCB at boot time and assigned a fixed address range
> where future memory could arrive. I.e. the CXL device is brought into
> the TCB at boot, but the memory it provides can arrive later.
> 
>>> Confidential VMs use guest_mem_fd to allocate memory, and that only
>>> pulls from the page allocator as a backend.
>>>
>>> This ability to use devdax in an offline mode is a hack to not
>>
>> Thanks, I was missing the "hack" of it, and somehow (once again) assumed
>> that we would be hotplugging memory into confidential VMs.
> 
> When / if dynamic capacity and this security-protocol for CXL arrives
> that may yet happen. For now it is safe to block adding anything which
> mcheck does not like which is everything but memory present at boot
> (is_tdx_memory()).

Makes sense, thanks!
David Hildenbrand Oct. 4, 2024, 10:21 a.m. UTC | #12
On 04.10.24 05:15, Dan Williams wrote:
> Yang Shi wrote:
>> On Thu, Oct 3, 2024 at 4:32 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>>
>>> Yang Shi wrote:
>>>> On Mon, Sep 30, 2024 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>>>>>
>>>>> Hi, David,
>>>>>
>>>>> Thanks a lot for comments!
>>>>>
>>>>> David Hildenbrand <david@redhat.com> writes:
>>>>>
>>>>>> On 30.09.24 07:51, Huang Ying wrote:
>>>>>>> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
>>>>>>> hot-added must be checked for compatibility by TDX.  This is currently
>>>>>>> implemented through memory hotplug notifiers for each memory_block.
>>>>>>> If a memory range which isn't TDX compatible is hot-added, for
>>>>>>> example, some CXL memory, the command line as follows,
>>>>>>>     $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
>>>>>>> will report something like,
>>>>>>>     bash: echo: write error: Operation not permitted
>>>>>>> If pr_debug() is enabled, the error message like below will be shown
>>>>>>> in the kernel log,
>>>>>>>     online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
>>>>>>> Both are too general to root cause the problem.  This will confuse
>>>>>>> users.  One solution is to print some error messages in the TDX memory
>>>>>>> hotplug notifier.  However, memory hotplug notifiers are called for
>>>>>>> each memory block, so this may lead to a large volume of messages in
>>>>>>> the kernel log if a large number of memory blocks are onlined with a
>>>>>>> script or automatically.  For example, the typical size of memory
>>>>>>> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
>>>>>>> will be logged.
>>>>>>
>>>>>> ratelimiting would likely help here a lot, but I agree that it is
>>>>>> suboptimal.
>>>>>>
>>>>>>> Therefore, in this patch, the whole hot-adding memory range is
>>>>>>> checked
>>>>>>> for TDX compatibility through a newly added architecture specific
>>>>>>> function (arch_check_hotplug_memory_range()).  If rejected, the memory
>>>>>>> hot-adding will be aborted with a proper kernel log message.  Which
>>>>>>> looks like something as below,
>>>>>>>     virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX
>>>>>>> for TDX compatibility.
>>>>>>>> The target use case is to support CXL memory on TDX enabled systems.
>>>>>>> If the CXL memory isn't compatible with TDX, the whole CXL memory
>>>>>>> range hot-adding will be rejected.  While the CXL memory can still be
>>>>>>> used via devdax interface.
>>>>>>
>>>>>> I'm curious, why can that memory be used through devdax but not
>>>>>> through the buddy? I'm probably missing something important :)
>>>>>
>>>>> Because only TDX compatible memory can be used for TDX guest.  The buddy
>>>>> is used to allocate memory for TDX guest.  While devdax will not be used
>>>>> for that.
>>>>
>>>> Sorry for chiming in late. I think CXL also faces the similar problem
>>>> on the platform with MTE (memory tagging extension on ARM64). AFAIK,
>>>> we can't have MTE on CXL, so CXL has to stay as dax device if MTE is
>>>> enabled.
>>>>
>>>> We should need a similar mechanism to prevent users from hot-adding
>>>> CXL memory if MTE is on. But not like TDX I don't think we have a
>>>> simple way to tell whether the pfn belongs to CXL or not. Please
>>>> correct me if I'm wrong. I'm wondering whether we can find a more
>>>> common way to tell memory hotplug to not hot-add some region. For
>>>> example, a special flag in struct resource. off the top of my head.
>>>>
>>>> No solid idea yet, I'm definitely seeking some advice.
>>>
>>> Could the ARM version of arch_check_hotplug_memory_range() check if MTE
>>> is enabled in the CPU and then ask the CXL subsystem if the address range is
>>> backed by a topology that supports MTE?
>>
>> Kernel can tell whether MTE is really enabled. For the CXL part, IIUC
>> that relies on the CXL subsystem is able to tell whether that range
>> can support MTE or not, right? Or CXL subsystem tells us whether the
>> range is CXL memory range or not, then we can just refuse MTE for all
>> CXL regions for now. Does CXL support this now?
> 
> So the CXL specification has section:
> 
>      8.2.4.31 CXL Extended Metadata Capability Register
> 
> ...that indicates if the device supports "Extended Metadata" (EMD).
> However, the CXL specification does not talk about how a given hosts
> uses the extended metadata capabilities of a device. That detail would
> need to come from an ARM platform specification.
> 
> Currently CXL subsystem does nothing with this since there has been no
> need to date, but I would expect someone from the ARM side to plumb this
> detection into the CXL subsystem.
> 
>>> However, why would it be ok to access CXL memory without MTE via devdax,
>>> but not as online page allocator memory?
>>
>> CXL memory can be onlined as system ram as long as MTE is not enabled.
>> It just can be used as devdax device if MTE is enabled.
> 
> Do you mean the kernel only manages MTE for kernel pages, but with user
> mapped memory the application will need to implicitly know that
> memory-tagging is not available?
> 
> I worry about applications that might not know that their heap is coming
> from a userspace memory allocator backed by device-dax rather than the
> kernel.

I recall that MTE is requested by user space via mprotect(). If we end 
up with memory that is not taggable, we would have to fail the 
operation, which is not desirable.

This is what we want to avoid, so if MTE is enabled, all memory in the 
buddy should be taggable.

> 
>>> If the goal is to simply deny any and all non-MTE supported CXL region
>>> from attaching then that could probably be handled as a modification to
>>> the "cxl_acpi" driver to deny region creation unless it supports
>>> everything the CPU expects from "memory".
>>
>> I'm not quite familiar with the details in CXL driver. What did you
>> mean "deny region creation"? As long as the CXL memory still can be
>> used as devdax device, it should be fine.
> 
> Meaning that the CXL subsytem knows how to, for a given address range, figure
> out the members and geometry of the CXL devices that contribute to that
> range (CXL region). It would be straightforward to add EMD to that
> enumeration and flag the CXL region as not online-capable if the CPU has
> MTE enabled but no EMD capability.

If it's really just CXL memory we are worrying about, we could pass a 
flag to add_memory_driver_managed(), and passing that to our callback here.

Not sure if that is the most reliable way of handling it :) What about 
other ways of hotplugging memory besides CXL? Are we sure, they are/will 
be providing taggable memory?
Yang Shi Oct. 4, 2024, 3:46 p.m. UTC | #13
On Thu, Oct 3, 2024 at 8:15 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Yang Shi wrote:
> > On Thu, Oct 3, 2024 at 4:32 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > Yang Shi wrote:
> > > > On Mon, Sep 30, 2024 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> > > > >
> > > > > Hi, David,
> > > > >
> > > > > Thanks a lot for comments!
> > > > >
> > > > > David Hildenbrand <david@redhat.com> writes:
> > > > >
> > > > > > On 30.09.24 07:51, Huang Ying wrote:
> > > > > >> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
> > > > > >> hot-added must be checked for compatibility by TDX.  This is currently
> > > > > >> implemented through memory hotplug notifiers for each memory_block.
> > > > > >> If a memory range which isn't TDX compatible is hot-added, for
> > > > > >> example, some CXL memory, the command line as follows,
> > > > > >>    $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
> > > > > >> will report something like,
> > > > > >>    bash: echo: write error: Operation not permitted
> > > > > >> If pr_debug() is enabled, the error message like below will be shown
> > > > > >> in the kernel log,
> > > > > >>    online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
> > > > > >> Both are too general to root cause the problem.  This will confuse
> > > > > >> users.  One solution is to print some error messages in the TDX memory
> > > > > >> hotplug notifier.  However, memory hotplug notifiers are called for
> > > > > >> each memory block, so this may lead to a large volume of messages in
> > > > > >> the kernel log if a large number of memory blocks are onlined with a
> > > > > >> script or automatically.  For example, the typical size of memory
> > > > > >> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
> > > > > >> will be logged.
> > > > > >
> > > > > > ratelimiting would likely help here a lot, but I agree that it is
> > > > > > suboptimal.
> > > > > >
> > > > > >> Therefore, in this patch, the whole hot-adding memory range is
> > > > > >> checked
> > > > > >> for TDX compatibility through a newly added architecture specific
> > > > > >> function (arch_check_hotplug_memory_range()).  If rejected, the memory
> > > > > >> hot-adding will be aborted with a proper kernel log message.  Which
> > > > > >> looks like something as below,
> > > > > >>    virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX
> > > > > >> for TDX compatibility.
> > > > > >> > The target use case is to support CXL memory on TDX enabled systems.
> > > > > >> If the CXL memory isn't compatible with TDX, the whole CXL memory
> > > > > >> range hot-adding will be rejected.  While the CXL memory can still be
> > > > > >> used via devdax interface.
> > > > > >
> > > > > > I'm curious, why can that memory be used through devdax but not
> > > > > > through the buddy? I'm probably missing something important :)
> > > > >
> > > > > Because only TDX compatible memory can be used for TDX guest.  The buddy
> > > > > is used to allocate memory for TDX guest.  While devdax will not be used
> > > > > for that.
> > > >
> > > > Sorry for chiming in late. I think CXL also faces the similar problem
> > > > on the platform with MTE (memory tagging extension on ARM64). AFAIK,
> > > > we can't have MTE on CXL, so CXL has to stay as dax device if MTE is
> > > > enabled.
> > > >
> > > > We should need a similar mechanism to prevent users from hot-adding
> > > > CXL memory if MTE is on. But not like TDX I don't think we have a
> > > > simple way to tell whether the pfn belongs to CXL or not. Please
> > > > correct me if I'm wrong. I'm wondering whether we can find a more
> > > > common way to tell memory hotplug to not hot-add some region. For
> > > > example, a special flag in struct resource. off the top of my head.
> > > >
> > > > No solid idea yet, I'm definitely seeking some advice.
> > >
> > > Could the ARM version of arch_check_hotplug_memory_range() check if MTE
> > > is enabled in the CPU and then ask the CXL subsystem if the address range is
> > > backed by a topology that supports MTE?
> >
> > Kernel can tell whether MTE is really enabled. For the CXL part, IIUC
> > that relies on the CXL subsystem is able to tell whether that range
> > can support MTE or not, right? Or CXL subsystem tells us whether the
> > range is CXL memory range or not, then we can just refuse MTE for all
> > CXL regions for now. Does CXL support this now?
>
> So the CXL specification has section:
>
>     8.2.4.31 CXL Extended Metadata Capability Register
>
> ...that indicates if the device supports "Extended Metadata" (EMD).
> However, the CXL specification does not talk about how a given hosts
> uses the extended metadata capabilities of a device. That detail would
> need to come from an ARM platform specification.
>
> Currently CXL subsystem does nothing with this since there has been no
> need to date, but I would expect someone from the ARM side to plumb this
> detection into the CXL subsystem.

Yeah, it should be a good way to let the kernel know whether CXL
supports memory tagging or not.

>
> > > However, why would it be ok to access CXL memory without MTE via devdax,
> > > but not as online page allocator memory?
> >
> > CXL memory can be onlined as system ram as long as MTE is not enabled.
> > It just can be used as devdax device if MTE is enabled.
>
> Do you mean the kernel only manages MTE for kernel pages, but with user
> mapped memory the application will need to implicitly know that
> memory-tagging is not available?

I think the current assumption is that all buddy memory (can be used
by userspace) should be taggable. And memory tagging is only supported
for anonymous mapping and tmpfs. I'm adding hugetlbfs support. But any
memory backed by the real backing store doesn't have memory tagging
support.

>
> I worry about applications that might not know that their heap is coming
> from a userspace memory allocator backed by device-dax rather than the
> kernel.

IIUC, memory mapping from device-dax is a file mapping, right? If so,
it is safe. If it is not, I think it is easy to handle. We can just
reject any VM_MTE mapping from DAX.

>
> > > If the goal is to simply deny any and all non-MTE supported CXL region
> > > from attaching then that could probably be handled as a modification to
> > > the "cxl_acpi" driver to deny region creation unless it supports
> > > everything the CPU expects from "memory".
> >
> > I'm not quite familiar with the details in CXL driver. What did you
> > mean "deny region creation"? As long as the CXL memory still can be
> > used as devdax device, it should be fine.
>
> Meaning that the CXL subsytem knows how to, for a given address range, figure
> out the members and geometry of the CXL devices that contribute to that
> range (CXL region). It would be straightforward to add EMD to that
> enumeration and flag the CXL region as not online-capable if the CPU has
> MTE enabled but no EMD capability.

It sounds like a good way to me.
Yang Shi Oct. 4, 2024, 3:54 p.m. UTC | #14
On Fri, Oct 4, 2024 at 3:21 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.10.24 05:15, Dan Williams wrote:
> > Yang Shi wrote:
> >> On Thu, Oct 3, 2024 at 4:32 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >>>
> >>> Yang Shi wrote:
> >>>> On Mon, Sep 30, 2024 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>>>>
> >>>>> Hi, David,
> >>>>>
> >>>>> Thanks a lot for comments!
> >>>>>
> >>>>> David Hildenbrand <david@redhat.com> writes:
> >>>>>
> >>>>>> On 30.09.24 07:51, Huang Ying wrote:
> >>>>>>> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
> >>>>>>> hot-added must be checked for compatibility by TDX.  This is currently
> >>>>>>> implemented through memory hotplug notifiers for each memory_block.
> >>>>>>> If a memory range which isn't TDX compatible is hot-added, for
> >>>>>>> example, some CXL memory, the command line as follows,
> >>>>>>>     $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
> >>>>>>> will report something like,
> >>>>>>>     bash: echo: write error: Operation not permitted
> >>>>>>> If pr_debug() is enabled, the error message like below will be shown
> >>>>>>> in the kernel log,
> >>>>>>>     online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
> >>>>>>> Both are too general to root cause the problem.  This will confuse
> >>>>>>> users.  One solution is to print some error messages in the TDX memory
> >>>>>>> hotplug notifier.  However, memory hotplug notifiers are called for
> >>>>>>> each memory block, so this may lead to a large volume of messages in
> >>>>>>> the kernel log if a large number of memory blocks are onlined with a
> >>>>>>> script or automatically.  For example, the typical size of memory
> >>>>>>> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
> >>>>>>> will be logged.
> >>>>>>
> >>>>>> ratelimiting would likely help here a lot, but I agree that it is
> >>>>>> suboptimal.
> >>>>>>
> >>>>>>> Therefore, in this patch, the whole hot-adding memory range is
> >>>>>>> checked
> >>>>>>> for TDX compatibility through a newly added architecture specific
> >>>>>>> function (arch_check_hotplug_memory_range()).  If rejected, the memory
> >>>>>>> hot-adding will be aborted with a proper kernel log message.  Which
> >>>>>>> looks like something as below,
> >>>>>>>     virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX
> >>>>>>> for TDX compatibility.
> >>>>>>>> The target use case is to support CXL memory on TDX enabled systems.
> >>>>>>> If the CXL memory isn't compatible with TDX, the whole CXL memory
> >>>>>>> range hot-adding will be rejected.  While the CXL memory can still be
> >>>>>>> used via devdax interface.
> >>>>>>
> >>>>>> I'm curious, why can that memory be used through devdax but not
> >>>>>> through the buddy? I'm probably missing something important :)
> >>>>>
> >>>>> Because only TDX compatible memory can be used for TDX guest.  The buddy
> >>>>> is used to allocate memory for TDX guest.  While devdax will not be used
> >>>>> for that.
> >>>>
> >>>> Sorry for chiming in late. I think CXL also faces the similar problem
> >>>> on the platform with MTE (memory tagging extension on ARM64). AFAIK,
> >>>> we can't have MTE on CXL, so CXL has to stay as dax device if MTE is
> >>>> enabled.
> >>>>
> >>>> We should need a similar mechanism to prevent users from hot-adding
> >>>> CXL memory if MTE is on. But not like TDX I don't think we have a
> >>>> simple way to tell whether the pfn belongs to CXL or not. Please
> >>>> correct me if I'm wrong. I'm wondering whether we can find a more
> >>>> common way to tell memory hotplug to not hot-add some region. For
> >>>> example, a special flag in struct resource. off the top of my head.
> >>>>
> >>>> No solid idea yet, I'm definitely seeking some advice.
> >>>
> >>> Could the ARM version of arch_check_hotplug_memory_range() check if MTE
> >>> is enabled in the CPU and then ask the CXL subsystem if the address range is
> >>> backed by a topology that supports MTE?
> >>
> >> Kernel can tell whether MTE is really enabled. For the CXL part, IIUC
> >> that relies on the CXL subsystem is able to tell whether that range
> >> can support MTE or not, right? Or CXL subsystem tells us whether the
> >> range is CXL memory range or not, then we can just refuse MTE for all
> >> CXL regions for now. Does CXL support this now?
> >
> > So the CXL specification has section:
> >
> >      8.2.4.31 CXL Extended Metadata Capability Register
> >
> > ...that indicates if the device supports "Extended Metadata" (EMD).
> > However, the CXL specification does not talk about how a given hosts
> > uses the extended metadata capabilities of a device. That detail would
> > need to come from an ARM platform specification.
> >
> > Currently CXL subsystem does nothing with this since there has been no
> > need to date, but I would expect someone from the ARM side to plumb this
> > detection into the CXL subsystem.
> >
> >>> However, why would it be ok to access CXL memory without MTE via devdax,
> >>> but not as online page allocator memory?
> >>
> >> CXL memory can be onlined as system ram as long as MTE is not enabled.
> >> It just can be used as devdax device if MTE is enabled.
> >
> > Do you mean the kernel only manages MTE for kernel pages, but with user
> > mapped memory the application will need to implicitly know that
> > memory-tagging is not available?
> >
> > I worry about applications that might not know that their heap is coming
> > from a userspace memory allocator backed by device-dax rather than the
> > kernel.
>
> I recall that MTE is requested by user space via mprotect(). If we end
> up with memory that is not taggable, we would have to fail the
> operation, which is not desirable.
>
> This is what we want to avoid, so if MTE is enabled, all memory in the
> buddy should be taggable.

Yes, the buddy memory has to be taggable if MTE is enabled. And not
only mprotect(), but also mmap() and malloc() (glibc compiled with MTE
support) can allocate mapping with MTE. And MTE mapping is just
allowed for anonymous and tmpfs currently.

>
> >
> >>> If the goal is to simply deny any and all non-MTE supported CXL region
> >>> from attaching then that could probably be handled as a modification to
> >>> the "cxl_acpi" driver to deny region creation unless it supports
> >>> everything the CPU expects from "memory".
> >>
> >> I'm not quite familiar with the details in CXL driver. What did you
> >> mean "deny region creation"? As long as the CXL memory still can be
> >> used as devdax device, it should be fine.
> >
> > Meaning that the CXL subsytem knows how to, for a given address range, figure
> > out the members and geometry of the CXL devices that contribute to that
> > range (CXL region). It would be straightforward to add EMD to that
> > enumeration and flag the CXL region as not online-capable if the CPU has
> > MTE enabled but no EMD capability.
>
> If it's really just CXL memory we are worrying about, we could pass a
> flag to add_memory_driver_managed(), and passing that to our callback here.
>
> Not sure if that is the most reliable way of handling it :) What about
> other ways of hotplugging memory besides CXL? Are we sure, they are/will
> be providing taggable memory?

AFAIK, I don't think they are, or at least some of them are not. So
this should be not CXL specific.

>
> --
> Cheers,
>
> David / dhildenb
>
James Morse Oct. 10, 2024, 5:52 p.m. UTC | #15
Hi guys,

On 04/10/2024 16:46, Yang Shi wrote:
> On Thu, Oct 3, 2024 at 8:15 PM Dan Williams <dan.j.williams@intel.com> wrote:
>> Yang Shi wrote:
>>> On Thu, Oct 3, 2024 at 4:32 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>>> Yang Shi wrote:
>>>>> On Mon, Sep 30, 2024 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>>>>>>
>>>>>> Hi, David,
>>>>>>
>>>>>> Thanks a lot for comments!
>>>>>>
>>>>>> David Hildenbrand <david@redhat.com> writes:
>>>>>>
>>>>>>> On 30.09.24 07:51, Huang Ying wrote:
>>>>>>>> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
>>>>>>>> hot-added must be checked for compatibility by TDX.  This is currently
>>>>>>>> implemented through memory hotplug notifiers for each memory_block.
>>>>>>>> If a memory range which isn't TDX compatible is hot-added, for
>>>>>>>> example, some CXL memory, the command line as follows,
>>>>>>>>    $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
>>>>>>>> will report something like,
>>>>>>>>    bash: echo: write error: Operation not permitted
>>>>>>>> If pr_debug() is enabled, the error message like below will be shown
>>>>>>>> in the kernel log,
>>>>>>>>    online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
>>>>>>>> Both are too general to root cause the problem.  This will confuse
>>>>>>>> users.  One solution is to print some error messages in the TDX memory
>>>>>>>> hotplug notifier.  However, memory hotplug notifiers are called for
>>>>>>>> each memory block, so this may lead to a large volume of messages in
>>>>>>>> the kernel log if a large number of memory blocks are onlined with a
>>>>>>>> script or automatically.  For example, the typical size of memory
>>>>>>>> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
>>>>>>>> will be logged.
>>>>>>>
>>>>>>> ratelimiting would likely help here a lot, but I agree that it is
>>>>>>> suboptimal.
>>>>>>>
>>>>>>>> Therefore, in this patch, the whole hot-adding memory range is
>>>>>>>> checked
>>>>>>>> for TDX compatibility through a newly added architecture specific
>>>>>>>> function (arch_check_hotplug_memory_range()).  If rejected, the memory
>>>>>>>> hot-adding will be aborted with a proper kernel log message.  Which
>>>>>>>> looks like something as below,
>>>>>>>>    virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX
>>>>>>>> for TDX compatibility.
>>>>>>>>> The target use case is to support CXL memory on TDX enabled systems.
>>>>>>>> If the CXL memory isn't compatible with TDX, the whole CXL memory
>>>>>>>> range hot-adding will be rejected.  While the CXL memory can still be
>>>>>>>> used via devdax interface.
>>>>>>>
>>>>>>> I'm curious, why can that memory be used through devdax but not
>>>>>>> through the buddy? I'm probably missing something important :)
>>>>>>
>>>>>> Because only TDX compatible memory can be used for TDX guest.  The buddy
>>>>>> is used to allocate memory for TDX guest.  While devdax will not be used
>>>>>> for that.
>>>>>
>>>>> Sorry for chiming in late. I think CXL also faces the similar problem
>>>>> on the platform with MTE (memory tagging extension on ARM64). AFAIK,
>>>>> we can't have MTE on CXL, so CXL has to stay as dax device if MTE is
>>>>> enabled.
>>>>>
>>>>> We should need a similar mechanism to prevent users from hot-adding
>>>>> CXL memory if MTE is on. But not like TDX I don't think we have a
>>>>> simple way to tell whether the pfn belongs to CXL or not. Please
>>>>> correct me if I'm wrong. I'm wondering whether we can find a more
>>>>> common way to tell memory hotplug to not hot-add some region. For
>>>>> example, a special flag in struct resource. off the top of my head.
>>>>>
>>>>> No solid idea yet, I'm definitely seeking some advice.
>>>>
>>>> Could the ARM version of arch_check_hotplug_memory_range() check if MTE
>>>> is enabled in the CPU and then ask the CXL subsystem if the address range is
>>>> backed by a topology that supports MTE?
>>>
>>> Kernel can tell whether MTE is really enabled. For the CXL part, IIUC
>>> that relies on the CXL subsystem is able to tell whether that range
>>> can support MTE or not, right? Or CXL subsystem tells us whether the
>>> range is CXL memory range or not, then we can just refuse MTE for all
>>> CXL regions for now. Does CXL support this now?
>>
>> So the CXL specification has section:
>>
>>     8.2.4.31 CXL Extended Metadata Capability Register
>>
>> ...that indicates if the device supports "Extended Metadata" (EMD).
>> However, the CXL specification does not talk about how a given hosts
>> uses the extended metadata capabilities of a device. That detail would
>> need to come from an ARM platform specification.
>>
>> Currently CXL subsystem does nothing with this since there has been no
>> need to date, but I would expect someone from the ARM side to plumb this
>> detection into the CXL subsystem.
> 
> Yeah, it should be a good way to let the kernel know whether CXL
> supports memory tagging or not.

On its own I don't think its enough - there would need to be some kind of capability in
both the CXL root-port and the device to say that MTE tags are sent in that metadata
field. If both support it, then the device memory supports MTE.

(I'll poke the standards people to see if this is something they already have in the
 works...)


>>>> However, why would it be ok to access CXL memory without MTE via devdax,
>>>> but not as online page allocator memory?

>>> CXL memory can be onlined as system ram as long as MTE is not enabled.
>>> It just can be used as devdax device if MTE is enabled.

This makes sense to me.

We can print a warning that 'arm64.nomte' should be passed on the command line if the CXL
memory is more important than MTE and the hardware can't support both.


>> Do you mean the kernel only manages MTE for kernel pages, but with user
>> mapped memory the application will need to implicitly know that
>> memory-tagging is not available?
> 
> I think the current assumption is that all buddy memory (can be used
> by userspace) should be taggable. And memory tagging is only supported
> for anonymous mapping and tmpfs. I'm adding hugetlbfs support. But any
> memory backed by the real backing store doesn't have memory tagging
> support.

Hopefully there are no assumptions here! -
Documentation/arch/arm64/memory-tagging-extension.rst says anonymous mappings can have
PROT_MTE set.

The arch code requires all memory to support MTE if the CPUs support it.


>> I worry about applications that might not know that their heap is coming
>> from a userspace memory allocator backed by device-dax rather than the
>> kernel.
> 
> IIUC, memory mapping from device-dax is a file mapping, right? If so,
> it is safe. If it is not, I think it is easy to handle. We can just
> reject any VM_MTE mapping from DAX.

That should already be the case. (we should check!)

Because devdax is already a file-mapping, user-space can't expect MTE to work.
While some library may not know the memory came from devdax - whoever wrote the
malloc()/free() implementation will have known they were using devdax - this is where the
decisions to use MTE and what tag to use is made.

I don't think this adds a new broken case.


>>>> If the goal is to simply deny any and all non-MTE supported CXL region
>>>> from attaching then that could probably be handled as a modification to
>>>> the "cxl_acpi" driver to deny region creation unless it supports
>>>> everything the CPU expects from "memory".
>>>
>>> I'm not quite familiar with the details in CXL driver. What did you
>>> mean "deny region creation"? As long as the CXL memory still can be
>>> used as devdax device, it should be fine.
>>
>> Meaning that the CXL subsytem knows how to, for a given address range, figure
>> out the members and geometry of the CXL devices that contribute to that
>> range (CXL region). It would be straightforward to add EMD to that
>> enumeration and flag the CXL region as not online-capable if the CPU has
>> MTE enabled but no EMD capability.
> 
> It sounds like a good way to me.

From your earlier description, EMD may not be enough - and this would depend on the
root-port (or at least the host side decoders) to support this too. I'll poke the spec
people...


Thanks,

James
Dan Williams Oct. 10, 2024, 6:28 p.m. UTC | #16
James Morse wrote:
[..]
> > Yeah, it should be a good way to let the kernel know whether CXL
> > supports memory tagging or not.
> 
> On its own I don't think its enough - there would need to be some kind of capability in
> both the CXL root-port and the device to say that MTE tags are sent in that metadata
> field. If both support it, then the device memory supports MTE.
> 
> (I'll poke the standards people to see if this is something they already have in the
>  works...)

If it helps, the question I would ask is "will the ACPI CFMWS (CXL Fixed
Memory Window Structure), grow a new 'Window Restrictions' bit
indicating the presence of EMD support, or will it be left to an ARM
specific enumeration outside of CFMWS?".

> >>>> However, why would it be ok to access CXL memory without MTE via devdax,
> >>>> but not as online page allocator memory?
> 
> >>> CXL memory can be onlined as system ram as long as MTE is not enabled.
> >>> It just can be used as devdax device if MTE is enabled.
> 
> This makes sense to me.
> 
> We can print a warning that 'arm64.nomte' should be passed on the command line if the CXL
> memory is more important than MTE and the hardware can't support both.
> 
> 
> >> Do you mean the kernel only manages MTE for kernel pages, but with user
> >> mapped memory the application will need to implicitly know that
> >> memory-tagging is not available?
> > 
> > I think the current assumption is that all buddy memory (can be used
> > by userspace) should be taggable. And memory tagging is only supported
> > for anonymous mapping and tmpfs. I'm adding hugetlbfs support. But any
> > memory backed by the real backing store doesn't have memory tagging
> > support.
> 
> Hopefully there are no assumptions here! -
> Documentation/arch/arm64/memory-tagging-extension.rst says anonymous mappings can have
> PROT_MTE set.
> 
> The arch code requires all memory to support MTE if the CPUs support it.
> 
> 
> >> I worry about applications that might not know that their heap is coming
> >> from a userspace memory allocator backed by device-dax rather than the
> >> kernel.
> > 
> > IIUC, memory mapping from device-dax is a file mapping, right? If so,
> > it is safe. If it is not, I think it is easy to handle. We can just
> > reject any VM_MTE mapping from DAX.
> 
> That should already be the case. (we should check!)
> 
> Because devdax is already a file-mapping, user-space can't expect MTE to work.
> While some library may not know the memory came from devdax - whoever wrote the
> malloc()/free() implementation will have known they were using devdax - this is where the
> decisions to use MTE and what tag to use is made.
> 
> I don't think this adds a new broken case.

Yeah, makes sense.

> >>>> If the goal is to simply deny any and all non-MTE supported CXL region
> >>>> from attaching then that could probably be handled as a modification to
> >>>> the "cxl_acpi" driver to deny region creation unless it supports
> >>>> everything the CPU expects from "memory".
> >>>
> >>> I'm not quite familiar with the details in CXL driver. What did you
> >>> mean "deny region creation"? As long as the CXL memory still can be
> >>> used as devdax device, it should be fine.
> >>
> >> Meaning that the CXL subsytem knows how to, for a given address range, figure
> >> out the members and geometry of the CXL devices that contribute to that
> >> range (CXL region). It would be straightforward to add EMD to that
> >> enumeration and flag the CXL region as not online-capable if the CPU has
> >> MTE enabled but no EMD capability.
> > 
> > It sounds like a good way to me.
> 
> From your earlier description, EMD may not be enough - and this would depend on the
> root-port (or at least the host side decoders) to support this too. I'll poke the spec
> people...

About the best CXL could do is indicate that the CXL window supports
EMD, but that is not sufficient for determining the arch capability for
MTE, so something tells me this might end up being an ARM specific (ACPI
or otherwise) enumeration to flag which if any CXL windows support MTE
regardless of EMD support.
Yang Shi Oct. 10, 2024, 7:46 p.m. UTC | #17
On Thu, Oct 10, 2024 at 10:52 AM James Morse <james.morse@arm.com> wrote:
>
> Hi guys,
>
> On 04/10/2024 16:46, Yang Shi wrote:
> > On Thu, Oct 3, 2024 at 8:15 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >> Yang Shi wrote:
> >>> On Thu, Oct 3, 2024 at 4:32 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >>>> Yang Shi wrote:
> >>>>> On Mon, Sep 30, 2024 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>>>>>
> >>>>>> Hi, David,
> >>>>>>
> >>>>>> Thanks a lot for comments!
> >>>>>>
> >>>>>> David Hildenbrand <david@redhat.com> writes:
> >>>>>>
> >>>>>>> On 30.09.24 07:51, Huang Ying wrote:
> >>>>>>>> On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
> >>>>>>>> hot-added must be checked for compatibility by TDX.  This is currently
> >>>>>>>> implemented through memory hotplug notifiers for each memory_block.
> >>>>>>>> If a memory range which isn't TDX compatible is hot-added, for
> >>>>>>>> example, some CXL memory, the command line as follows,
> >>>>>>>>    $ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
> >>>>>>>> will report something like,
> >>>>>>>>    bash: echo: write error: Operation not permitted
> >>>>>>>> If pr_debug() is enabled, the error message like below will be shown
> >>>>>>>> in the kernel log,
> >>>>>>>>    online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
> >>>>>>>> Both are too general to root cause the problem.  This will confuse
> >>>>>>>> users.  One solution is to print some error messages in the TDX memory
> >>>>>>>> hotplug notifier.  However, memory hotplug notifiers are called for
> >>>>>>>> each memory block, so this may lead to a large volume of messages in
> >>>>>>>> the kernel log if a large number of memory blocks are onlined with a
> >>>>>>>> script or automatically.  For example, the typical size of memory
> >>>>>>>> block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
> >>>>>>>> will be logged.
> >>>>>>>
> >>>>>>> ratelimiting would likely help here a lot, but I agree that it is
> >>>>>>> suboptimal.
> >>>>>>>
> >>>>>>>> Therefore, in this patch, the whole hot-adding memory range is
> >>>>>>>> checked
> >>>>>>>> for TDX compatibility through a newly added architecture specific
> >>>>>>>> function (arch_check_hotplug_memory_range()).  If rejected, the memory
> >>>>>>>> hot-adding will be aborted with a proper kernel log message.  Which
> >>>>>>>> looks like something as below,
> >>>>>>>>    virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX
> >>>>>>>> for TDX compatibility.
> >>>>>>>>> The target use case is to support CXL memory on TDX enabled systems.
> >>>>>>>> If the CXL memory isn't compatible with TDX, the whole CXL memory
> >>>>>>>> range hot-adding will be rejected.  While the CXL memory can still be
> >>>>>>>> used via devdax interface.
> >>>>>>>
> >>>>>>> I'm curious, why can that memory be used through devdax but not
> >>>>>>> through the buddy? I'm probably missing something important :)
> >>>>>>
> >>>>>> Because only TDX compatible memory can be used for TDX guest.  The buddy
> >>>>>> is used to allocate memory for TDX guest.  While devdax will not be used
> >>>>>> for that.
> >>>>>
> >>>>> Sorry for chiming in late. I think CXL also faces the similar problem
> >>>>> on the platform with MTE (memory tagging extension on ARM64). AFAIK,
> >>>>> we can't have MTE on CXL, so CXL has to stay as dax device if MTE is
> >>>>> enabled.
> >>>>>
> >>>>> We should need a similar mechanism to prevent users from hot-adding
> >>>>> CXL memory if MTE is on. But not like TDX I don't think we have a
> >>>>> simple way to tell whether the pfn belongs to CXL or not. Please
> >>>>> correct me if I'm wrong. I'm wondering whether we can find a more
> >>>>> common way to tell memory hotplug to not hot-add some region. For
> >>>>> example, a special flag in struct resource. off the top of my head.
> >>>>>
> >>>>> No solid idea yet, I'm definitely seeking some advice.
> >>>>
> >>>> Could the ARM version of arch_check_hotplug_memory_range() check if MTE
> >>>> is enabled in the CPU and then ask the CXL subsystem if the address range is
> >>>> backed by a topology that supports MTE?
> >>>
> >>> Kernel can tell whether MTE is really enabled. For the CXL part, IIUC
> >>> that relies on the CXL subsystem is able to tell whether that range
> >>> can support MTE or not, right? Or CXL subsystem tells us whether the
> >>> range is CXL memory range or not, then we can just refuse MTE for all
> >>> CXL regions for now. Does CXL support this now?
> >>
> >> So the CXL specification has section:
> >>
> >>     8.2.4.31 CXL Extended Metadata Capability Register
> >>
> >> ...that indicates if the device supports "Extended Metadata" (EMD).
> >> However, the CXL specification does not talk about how a given hosts
> >> uses the extended metadata capabilities of a device. That detail would
> >> need to come from an ARM platform specification.
> >>
> >> Currently CXL subsystem does nothing with this since there has been no
> >> need to date, but I would expect someone from the ARM side to plumb this
> >> detection into the CXL subsystem.
> >
> > Yeah, it should be a good way to let the kernel know whether CXL
> > supports memory tagging or not.
>
> On its own I don't think its enough - there would need to be some kind of capability in
> both the CXL root-port and the device to say that MTE tags are sent in that metadata
> field. If both support it, then the device memory supports MTE.

OK, we need both root port and device support for MTE. IOW if either
of them is false, we know MTE can't be supported, so we won't online
CXL memory as system ram.

>
> (I'll poke the standards people to see if this is something they already have in the
>  works...)
>
>
> >>>> However, why would it be ok to access CXL memory without MTE via devdax,
> >>>> but not as online page allocator memory?
>
> >>> CXL memory can be onlined as system ram as long as MTE is not enabled.
> >>> It just can be used as devdax device if MTE is enabled.
>
> This makes sense to me.
>
> We can print a warning that 'arm64.nomte' should be passed on the command line if the CXL
> memory is more important than MTE and the hardware can't support both.

Sounds good to me.

>
>
> >> Do you mean the kernel only manages MTE for kernel pages, but with user
> >> mapped memory the application will need to implicitly know that
> >> memory-tagging is not available?
> >
> > I think the current assumption is that all buddy memory (can be used
> > by userspace) should be taggable. And memory tagging is only supported
> > for anonymous mapping and tmpfs. I'm adding hugetlbfs support. But any
> > memory backed by the real backing store doesn't have memory tagging
> > support.
>
> Hopefully there are no assumptions here! -
> Documentation/arch/arm64/memory-tagging-extension.rst says anonymous mappings can have
> PROT_MTE set.
>
> The arch code requires all memory to support MTE if the CPUs support it.
>
>
> >> I worry about applications that might not know that their heap is coming
> >> from a userspace memory allocator backed by device-dax rather than the
> >> kernel.
> >
> > IIUC, memory mapping from device-dax is a file mapping, right? If so,
> > it is safe. If it is not, I think it is easy to handle. We can just
> > reject any VM_MTE mapping from DAX.
>
> That should already be the case. (we should check!)
>
> Because devdax is already a file-mapping, user-space can't expect MTE to work.
> While some library may not know the memory came from devdax - whoever wrote the
> malloc()/free() implementation will have known they were using devdax - this is where the
> decisions to use MTE and what tag to use is made.
>
> I don't think this adds a new broken case.

I agree.

>
>
> >>>> If the goal is to simply deny any and all non-MTE supported CXL region
> >>>> from attaching then that could probably be handled as a modification to
> >>>> the "cxl_acpi" driver to deny region creation unless it supports
> >>>> everything the CPU expects from "memory".
> >>>
> >>> I'm not quite familiar with the details in CXL driver. What did you
> >>> mean "deny region creation"? As long as the CXL memory still can be
> >>> used as devdax device, it should be fine.
> >>
> >> Meaning that the CXL subsytem knows how to, for a given address range, figure
> >> out the members and geometry of the CXL devices that contribute to that
> >> range (CXL region). It would be straightforward to add EMD to that
> >> enumeration and flag the CXL region as not online-capable if the CPU has
> >> MTE enabled but no EMD capability.
> >
> > It sounds like a good way to me.
>
> From your earlier description, EMD may not be enough - and this would depend on the
> root-port (or at least the host side decoders) to support this too. I'll poke the spec
> people...

Thank you for following up.

>
>
> Thanks,
>
> James
diff mbox series

Patch

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index eba178996d84..6db5da34e4ba 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -116,11 +116,13 @@  static inline u64 sc_retry(sc_func_t func, u64 fn,
 int tdx_cpu_enable(void);
 int tdx_enable(void);
 const char *tdx_dump_mce_info(struct mce *m);
+int tdx_check_hotplug_memory_range(u64 start, u64 size);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
 static inline int tdx_enable(void)  { return -ENODEV; }
 static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
+static inline int tdx_check_hotplug_memory_range(u64 start, u64 size) { return 0; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ff253648706f..30a4ad4272ce 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -55,6 +55,7 @@ 
 #include <asm/uv/uv.h>
 #include <asm/setup.h>
 #include <asm/ftrace.h>
+#include <asm/tdx.h>
 
 #include "mm_internal.h"
 
@@ -974,6 +975,11 @@  int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 	return ret;
 }
 
+int arch_check_hotplug_memory_range(u64 start, u64 size)
+{
+	return tdx_check_hotplug_memory_range(start, size);
+}
+
 int arch_add_memory(int nid, u64 start, u64 size,
 		    struct mhp_params *params)
 {
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 4e2b2e2ac9f9..c477b04c5548 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1388,36 +1388,32 @@  static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
 	return false;
 }
 
-static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
-			       void *v)
+int tdx_check_hotplug_memory_range(u64 start, u64 size)
 {
-	struct memory_notify *mn = v;
-
-	if (action != MEM_GOING_ONLINE)
-		return NOTIFY_OK;
+	u64 start_pfn = PHYS_PFN(start);
+	u64 end_pfn = PHYS_PFN(start + size);
 
 	/*
 	 * Empty list means TDX isn't enabled.  Allow any memory
-	 * to go online.
+	 * to be hot-added.
 	 */
 	if (list_empty(&tdx_memlist))
-		return NOTIFY_OK;
+		return 0;
 
 	/*
 	 * The TDX memory configuration is static and can not be
-	 * changed.  Reject onlining any memory which is outside of
+	 * changed.  Reject hot-adding any memory which is outside of
 	 * the static configuration whether it supports TDX or not.
 	 */
-	if (is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages))
-		return NOTIFY_OK;
+	if (is_tdx_memory(start_pfn, end_pfn))
+		return 0;
 
-	return NOTIFY_BAD;
+	pr_info("Reject hot-adding memory range: %#llx-%#llx for TDX compatibility.\n",
+		start, start + size);
+
+	return -EINVAL;
 }
 
-static struct notifier_block tdx_memory_nb = {
-	.notifier_call = tdx_memory_notifier,
-};
-
 static void __init check_tdx_erratum(void)
 {
 	/*
@@ -1465,13 +1461,6 @@  void __init tdx_init(void)
 		return;
 	}
 
-	err = register_memory_notifier(&tdx_memory_nb);
-	if (err) {
-		pr_err("initialization failed: register_memory_notifier() failed (%d)\n",
-				err);
-		return;
-	}
-
 #if defined(CONFIG_ACPI) && defined(CONFIG_SUSPEND)
 	pr_info("Disable ACPI S3. Turn off TDX in the BIOS to use ACPI S3.\n");
 	acpi_suspend_lowlevel = NULL;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index b27ddce5d324..c5ba7b909bb4 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -140,6 +140,9 @@  extern int try_online_node(int nid);
 
 extern int arch_add_memory(int nid, u64 start, u64 size,
 			   struct mhp_params *params);
+
+extern int arch_check_hotplug_memory_range(u64 start, u64 size);
+
 extern u64 max_mem_size;
 
 extern int mhp_online_type_from_str(const char *str);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 621ae1015106..c4769f24b1e2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1305,6 +1305,11 @@  int try_online_node(int nid)
 	return ret;
 }
 
+int __weak arch_check_hotplug_memory_range(u64 start, u64 size)
+{
+	return 0;
+}
+
 static int check_hotplug_memory_range(u64 start, u64 size)
 {
 	/* memory range must be block size aligned */
@@ -1315,7 +1320,7 @@  static int check_hotplug_memory_range(u64 start, u64 size)
 		return -EINVAL;
 	}
 
-	return 0;
+	return arch_check_hotplug_memory_range(start, size);
 }
 
 static int online_memory_block(struct memory_block *mem, void *arg)