[0/5] x86: Impplement support for unaccepted memory

Message ID	20210810062626.1012-1-kirill.shutemov@linux.intel.com (mailing list archive)
Headers	show Return-Path: <SRS0=utXM=NB=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 85FF26108F From: "Kirill A. Shutemov" <kirill@shutemov.name> To: Borislav Petkov <bp@alien8.de>, Andy Lutomirski <luto@kernel.org>, Sean Christopherson <seanjc@google.com>, Andrew Morton <akpm@linux-foundation.org>, Joerg Roedel <jroedel@suse.de> Cc: Andi Kleen <ak@linux.intel.com>, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>, David Rientjes <rientjes@google.com>, Vlastimil Babka <vbabka@suse.cz>, Tom Lendacky <thomas.lendacky@amd.com>, Thomas Gleixner <tglx@linutronix.de>, Peter Zijlstra <peterz@infradead.org>, Paolo Bonzini <pbonzini@redhat.com>, Ingo Molnar <mingo@redhat.com>, Varad Gautam <varad.gautam@suse.com>, Dario Faggioli <dfaggioli@suse.com>, x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Subject: [PATCH 0/5] x86: Impplement support for unaccepted memory Date: Tue, 10 Aug 2021 09:26:21 +0300 Message-Id: <20210810062626.1012-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	x86: Impplement support for unaccepted memory \| expand [0/5] x86: Impplement support for unaccepted memory [1/5] mm: Add support for unaccepted memory [2/5] efi/x86: Implement support for unaccepted memory [3/5] x86/boot/compressed: Handle unaccepted memory [4/5] x86/mm: Provide helpers for unaccepted memory [5/5] x86/tdx: Unaccepted memory support

Kirill A. Shutemov Aug. 10, 2021, 6:26 a.m. UTC

UEFI Specification version 2.9 introduces concept of memory acceptance:
Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
requiring memory to be accepted before it can be used by the guest.
Accepting happens via a protocol specific for the Virtrual Machine
platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. We don't want to accept all memory
upfront.

The patchset implements on-demand memory acceptance for TDX.

Please, review. Any feedback is welcome.

Kirill A. Shutemov (5):
  mm: Add support for unaccepted memory
  efi/x86: Implement support for unaccepted memory
  x86/boot/compressed: Handle unaccepted memory
  x86/mm: Provide helpers for unaccepted memory
  x86/tdx: Unaccepted memory support

 Documentation/x86/zero-page.rst              |  1 +
 arch/x86/Kconfig                             |  1 +
 arch/x86/boot/compressed/Makefile            |  1 +
 arch/x86/boot/compressed/bitmap.c            | 86 ++++++++++++++++++++
 arch/x86/boot/compressed/kaslr.c             | 14 +++-
 arch/x86/boot/compressed/misc.c              |  9 ++
 arch/x86/boot/compressed/tdx.c               | 29 +++++++
 arch/x86/boot/compressed/unaccepted_memory.c | 51 ++++++++++++
 arch/x86/include/asm/page.h                  |  5 ++
 arch/x86/include/asm/tdx.h                   |  2 +
 arch/x86/include/asm/unaccepted_memory.h     | 17 ++++
 arch/x86/include/uapi/asm/bootparam.h        |  3 +-
 arch/x86/kernel/tdx.c                        |  8 ++
 arch/x86/mm/Makefile                         |  2 +
 arch/x86/mm/unaccepted_memory.c              | 84 +++++++++++++++++++
 drivers/firmware/efi/Kconfig                 | 12 +++
 drivers/firmware/efi/efi.c                   |  1 +
 drivers/firmware/efi/libstub/x86-stub.c      | 75 ++++++++++++++---
 include/linux/efi.h                          |  3 +-
 mm/internal.h                                | 14 ++++
 mm/memblock.c                                |  1 +
 mm/page_alloc.c                              | 13 ++-
 22 files changed, 414 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/unaccepted_memory.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h
 create mode 100644 arch/x86/mm/unaccepted_memory.c

Dave Hansen Aug. 10, 2021, 2:08 p.m. UTC | #1

On 8/9/21 11:26 PM, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces concept of memory acceptance:
> Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> requiring memory to be accepted before it can be used by the guest.
> Accepting happens via a protocol specific for the Virtrual Machine
> platform.
> 
> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. We don't want to accept all memory
> upfront.

This could use a bit more explanation.  Any VM is likely to *eventually*
touch all its memory, so it's not like a VMM has a long-term advantage
by delaying this.

So, it must have to do with resource use at boot.  Is this to help boot
times?

I had expected this series, but I also expected it to be connected to
CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow.  Could you explain a bit how
this problem is different and demands a totally orthogonal solution?

For instance, what prevents us from declaring: "Memory is accepted at
the time that its 'struct page' is initialized" ?  Then, we use all the
infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT.

This series isn't too onerous, but I do want to make sure that we're not
reinventing the wheel.

Kirill A . Shutemov Aug. 10, 2021, 3:15 p.m. UTC | #2

On Tue, Aug 10, 2021 at 07:08:58AM -0700, Dave Hansen wrote:
> On 8/9/21 11:26 PM, Kirill A. Shutemov wrote:
> > UEFI Specification version 2.9 introduces concept of memory acceptance:
> > Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> > requiring memory to be accepted before it can be used by the guest.
> > Accepting happens via a protocol specific for the Virtrual Machine
> > platform.
> > 
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. We don't want to accept all memory
> > upfront.
> 
> This could use a bit more explanation.  Any VM is likely to *eventually*
> touch all its memory, so it's not like a VMM has a long-term advantage
> by delaying this.
> 
> So, it must have to do with resource use at boot.  Is this to help boot
> times?

Yes, boot time is main motivation.

But I'm going also to look at long-term VM behaviour with the fixed memory
footprint. I think if a workload allocate/free memory within the same
amount we can keep memory beyond the size unaccepted. Few tweaks likely
will be required such as disabling page shuffling on free to keep
unaccepted memory at the tail of free list. More investigation needed.

> I had expected this series, but I also expected it to be connected to
> CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow.  Could you explain a bit how
> this problem is different and demands a totally orthogonal solution?
>
> For instance, what prevents us from declaring: "Memory is accepted at
> the time that its 'struct page' is initialized" ?  Then, we use all the
> infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT.

That was my first thought too and I tried it just to realize that it is
not what we want. If we would accept page on page struct init it means we
would make host allocate all memory assigned to the guest on boot even if
guest actually use small portion of it.

Also deferred page init only allows to scale memory accept across multiple
CPUs, but doesn't allow to get to userspace before we done with it. See
wait_for_completion(&pgdat_init_all_done_comp).

Dave Hansen Aug. 10, 2021, 3:51 p.m. UTC | #3

On 8/10/21 8:15 AM, Kirill A. Shutemov wrote:
> On Tue, Aug 10, 2021 at 07:08:58AM -0700, Dave Hansen wrote:
>> On 8/9/21 11:26 PM, Kirill A. Shutemov wrote:
>>> UEFI Specification version 2.9 introduces concept of memory acceptance:
>>> Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
>>> requiring memory to be accepted before it can be used by the guest.
>>> Accepting happens via a protocol specific for the Virtrual Machine
>>> platform.
>>>
>>> Accepting memory is costly and it makes VMM allocate memory for the
>>> accepted guest physical address range. We don't want to accept all memory
>>> upfront.
>>
>> This could use a bit more explanation.  Any VM is likely to *eventually*
>> touch all its memory, so it's not like a VMM has a long-term advantage
>> by delaying this.
>>
>> So, it must have to do with resource use at boot.  Is this to help boot
>> times?
> 
> Yes, boot time is main motivation.
> 
> But I'm going also to look at long-term VM behaviour with the fixed memory
> footprint. I think if a workload allocate/free memory within the same
> amount we can keep memory beyond the size unaccepted. Few tweaks likely
> will be required such as disabling page shuffling on free to keep
> unaccepted memory at the tail of free list. More investigation needed.

OK, so this is predicated on the idea that a guest will not use all of
its assigned RAM and that the host will put that RAM to good use
elsewhere.  Right?

That's undercut by a a few of factors:
1. Many real-world cloud providers do not overcommit RAM.  If the guest
   does not use the RAM, it goes to waste.  (Yes, there are providers
   that overcommit, but we're talking generally about places where this
   proposal is useful).
2. Long-term, RAM fills up with page cache in many scenarios

So, this is really only beneficial for long-term host physical memory
use if:
1. The host is overcommitting
and
2. The guest never uses all of its RAM

Seeing as TDX doesn't support swap and can't coexist with persistent
memory, the only recourse for folks overcommitting TDX VMs when the run
out of RAM is to kill VMs.

I can't imagine that a lot of folks are going to do this.

In other words, I buy the boot speed argument.  But, I don't buy the
"this saves memory long term" argument at all.

>> I had expected this series, but I also expected it to be connected to
>> CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow.  Could you explain a bit how
>> this problem is different and demands a totally orthogonal solution?
>>
>> For instance, what prevents us from declaring: "Memory is accepted at
>> the time that its 'struct page' is initialized" ?  Then, we use all the
>> infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT.
> 
> That was my first thought too and I tried it just to realize that it is
> not what we want. If we would accept page on page struct init it means we
> would make host allocate all memory assigned to the guest on boot even if
> guest actually use small portion of it.
> 
> Also deferred page init only allows to scale memory accept across multiple
> CPUs, but doesn't allow to get to userspace before we done with it. See
> wait_for_completion(&pgdat_init_all_done_comp).

That's good information.  It's a refinement of the "I want to boot
faster" requirement.  What you want is not just going _faster_, but
being able to run userspace before full acceptance has completed.

Would you be able to quantify how fast TDX page acceptance is?  Are we
talking about MB/s, GB/s, TB/s?  This series is rather bereft of numbers
for a feature which making a performance claim.

Let's say we have a 128GB VM.  How much does faster does this approach
reach userspace than if all memory was accepted up front?  How much
memory _could_ have been accepted at the point userspace starts running?

Kirill A . Shutemov Aug. 10, 2021, 5:31 p.m. UTC | #4

On Tue, Aug 10, 2021 at 08:51:01AM -0700, Dave Hansen wrote:
> In other words, I buy the boot speed argument.  But, I don't buy the
> "this saves memory long term" argument at all.

Okay, that's a fair enough. I guess there's *some* workloads that may
have memory footprint reduced, but I agree it's minority.

> >> I had expected this series, but I also expected it to be connected to
> >> CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow.  Could you explain a bit how
> >> this problem is different and demands a totally orthogonal solution?
> >>
> >> For instance, what prevents us from declaring: "Memory is accepted at
> >> the time that its 'struct page' is initialized" ?  Then, we use all the
> >> infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT.
> > 
> > That was my first thought too and I tried it just to realize that it is
> > not what we want. If we would accept page on page struct init it means we
> > would make host allocate all memory assigned to the guest on boot even if
> > guest actually use small portion of it.
> > 
> > Also deferred page init only allows to scale memory accept across multiple
> > CPUs, but doesn't allow to get to userspace before we done with it. See
> > wait_for_completion(&pgdat_init_all_done_comp).
> 
> That's good information.  It's a refinement of the "I want to boot
> faster" requirement.  What you want is not just going _faster_, but
> being able to run userspace before full acceptance has completed.
> 
> Would you be able to quantify how fast TDX page acceptance is?  Are we
> talking about MB/s, GB/s, TB/s?  This series is rather bereft of numbers
> for a feature which making a performance claim.
> 
> Let's say we have a 128GB VM.  How much does faster does this approach
> reach userspace than if all memory was accepted up front?  How much
> memory _could_ have been accepted at the point userspace starts running?

Acceptance code is not optimized yet: we accept memory in 4k chunk which
is very slow because hypercall overhead dominates the picture.

As of now, kernel boot time of 1 VCPU and 64TiB VM with upfront memory
accept is >20 times slower than with this lazy memory accept approach.

The difference is going to be substantially lower once we get it optimized
properly.

Dave Hansen Aug. 10, 2021, 5:36 p.m. UTC | #5

On 8/10/21 10:31 AM, Kirill A. Shutemov wrote:
> On Tue, Aug 10, 2021 at 08:51:01AM -0700, Dave Hansen wrote:
>> Let's say we have a 128GB VM.  How much does faster does this approach
>> reach userspace than if all memory was accepted up front?  How much
>> memory _could_ have been accepted at the point userspace starts running?
> 
> Acceptance code is not optimized yet: we accept memory in 4k chunk which
> is very slow because hypercall overhead dominates the picture.
> 
> As of now, kernel boot time of 1 VCPU and 64TiB VM with upfront memory
> accept is >20 times slower than with this lazy memory accept approach.

That's a pretty big deal.

> The difference is going to be substantially lower once we get it optimized
> properly.

What does this mean?  Is this future work in the kernel or somewhere in
the TDX hardware/firmware which will speed things up?

Kirill A . Shutemov Aug. 10, 2021, 5:51 p.m. UTC | #6

On Tue, Aug 10, 2021 at 10:36:21AM -0700, Dave Hansen wrote:
> > The difference is going to be substantially lower once we get it optimized
> > properly.
> 
> What does this mean?  Is this future work in the kernel or somewhere in
> the TDX hardware/firmware which will speed things up?

Kernel has to be changed to accept memory in 2M and 1G chunks where
possible. The interface exists and described in spec, but not yet used in
guest kernel.

It would cut hypercall overhead dramatically. It makes upfront memory
accept more bearable and lowers latency of lazy memory accept. So I expect
the gap being not 20x, but like 3-5x (which is still huge).

Dave Hansen Aug. 10, 2021, 6:19 p.m. UTC | #7

On 8/10/21 10:51 AM, Kirill A. Shutemov wrote:
> On Tue, Aug 10, 2021 at 10:36:21AM -0700, Dave Hansen wrote:
>>> The difference is going to be substantially lower once we get it optimized
>>> properly.
>> What does this mean?  Is this future work in the kernel or somewhere in
>> the TDX hardware/firmware which will speed things up?
> Kernel has to be changed to accept memory in 2M and 1G chunks where
> possible. The interface exists and described in spec, but not yet used in
> guest kernel.

From a quick scan of the spec, I only see:

> 7.9.3. Page Acceptance by the Guest TD: TDG.MEM.PAGE.ACCEPT ... The guest
> TD can accept a dynamically added 4KB page using TDG.MEM.PAGE.ACCEPT
> with the page GPA as an input.
Is there some other 2M/1G page-acceptance call that I'm missing?

> It would cut hypercall overhead dramatically. It makes upfront memory
> accept more bearable and lowers latency of lazy memory accept. So I expect
> the gap being not 20x, but like 3-5x (which is still huge).

It would be nice to be able to judge the benefits of this series based
on the final form.  I guess we'll take what we can get, though.

Either way, I'd still like to see some *actual* numbers for at least one
configuration:

	With this series applied, userspace starts to run at X seconds
	after kernel boot.  Without this series, userspace runs at Y
	seconds.

Kirill A . Shutemov Aug. 10, 2021, 6:39 p.m. UTC | #8

On Tue, Aug 10, 2021 at 11:19:30AM -0700, Dave Hansen wrote:
> On 8/10/21 10:51 AM, Kirill A. Shutemov wrote:
> > On Tue, Aug 10, 2021 at 10:36:21AM -0700, Dave Hansen wrote:
> >>> The difference is going to be substantially lower once we get it optimized
> >>> properly.
> >> What does this mean?  Is this future work in the kernel or somewhere in
> >> the TDX hardware/firmware which will speed things up?
> > Kernel has to be changed to accept memory in 2M and 1G chunks where
> > possible. The interface exists and described in spec, but not yet used in
> > guest kernel.
> 
> From a quick scan of the spec, I only see:
> 
> > 7.9.3. Page Acceptance by the Guest TD: TDG.MEM.PAGE.ACCEPT ... The guest
> > TD can accept a dynamically added 4KB page using TDG.MEM.PAGE.ACCEPT
> > with the page GPA as an input.
> Is there some other 2M/1G page-acceptance call that I'm missing?

I referred to GHCI[1], section 2.4.7. RDX=0 is 4k, RDX=1 is 2M and
RDX=2 is 1G.

Public specs have mismatches. I hope it will get sorted out soon. :/

[1] https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

> > It would cut hypercall overhead dramatically. It makes upfront memory
> > accept more bearable and lowers latency of lazy memory accept. So I expect
> > the gap being not 20x, but like 3-5x (which is still huge).
> 
> It would be nice to be able to judge the benefits of this series based
> on the final form.  I guess we'll take what we can get, though.
> 
> Either way, I'd still like to see some *actual* numbers for at least one
> configuration:
> 
> 	With this series applied, userspace starts to run at X seconds
> 	after kernel boot.  Without this series, userspace runs at Y
> 	seconds.

Getting absolute numbers in public for unreleased product is tricky.
I hoped to get away with ratios or percentage of difference.

Joerg Roedel Aug. 12, 2021, 8:23 a.m. UTC | #9

Hi Kirill,

On Tue, Aug 10, 2021 at 09:26:21AM +0300, Kirill A. Shutemov wrote:
> Accepting happens via a protocol specific for the Virtrual Machine
> platform.

That sentence bothers me a bit. Can you explain what it VMM specific in
the acceptance protocol?

I want to avoid having to implement VMM specific acceptance protocols.

Regards,

	Joerg

Kirill A. Shutemov Aug. 12, 2021, 10:10 a.m. UTC | #10

On Thu, Aug 12, 2021 at 10:23:24AM +0200, Joerg Roedel wrote:
> Hi Kirill,
> 
> On Tue, Aug 10, 2021 at 09:26:21AM +0300, Kirill A. Shutemov wrote:
> > Accepting happens via a protocol specific for the Virtrual Machine
> > platform.
> 
> That sentence bothers me a bit. Can you explain what it VMM specific in
> the acceptance protocol?

For TDX we have a signle MapGPA hypercall to VMM plus TDAcceptPage for
every accepted page to TDX Module. SEV-SNP has to something similar.

Andi Kleen Aug. 12, 2021, 7:33 p.m. UTC | #11

On 8/12/2021 3:10 AM, Kirill A. Shutemov wrote:
> On Thu, Aug 12, 2021 at 10:23:24AM +0200, Joerg Roedel wrote:
>> Hi Kirill,
>>
>> On Tue, Aug 10, 2021 at 09:26:21AM +0300, Kirill A. Shutemov wrote:
>>> Accepting happens via a protocol specific for the Virtrual Machine
>>> platform.
>> That sentence bothers me a bit. Can you explain what it VMM specific in
>> the acceptance protocol?
> For TDX we have a signle MapGPA hypercall to VMM plus TDAcceptPage for
> every accepted page to TDX Module. SEV-SNP has to something similar.

I think Joerg's question was if TDX has a single ABI for all 
hypervisors. The GHCI specification supports both hypervisor specific 
and hypervisor agnostic calls. But these basic operations like MapGPA 
are all hypervisor agnostic. The only differences would be in the 
existing hypervisor specific PV code.

-Andi

Kirill A. Shutemov Aug. 12, 2021, 8:22 p.m. UTC | #12

On Thu, Aug 12, 2021 at 12:33:11PM -0700, Andi Kleen wrote:
> 
> On 8/12/2021 3:10 AM, Kirill A. Shutemov wrote:
> > On Thu, Aug 12, 2021 at 10:23:24AM +0200, Joerg Roedel wrote:
> > > Hi Kirill,
> > > 
> > > On Tue, Aug 10, 2021 at 09:26:21AM +0300, Kirill A. Shutemov wrote:
> > > > Accepting happens via a protocol specific for the Virtrual Machine
> > > > platform.
> > > That sentence bothers me a bit. Can you explain what it VMM specific in
> > > the acceptance protocol?
> > For TDX we have a signle MapGPA hypercall to VMM plus TDAcceptPage for
> > every accepted page to TDX Module. SEV-SNP has to something similar.
> 
> 
> I think Joerg's question was if TDX has a single ABI for all hypervisors.
> The GHCI specification supports both hypervisor specific and hypervisor
> agnostic calls. But these basic operations like MapGPA are all hypervisor
> agnostic. The only differences would be in the existing hypervisor specific
> PV code.

My point was that TDX and SEV-SNP going to be different and we need a way
to hook the right protocol for each.

Joerg Roedel Aug. 13, 2021, 2:56 p.m. UTC | #13

On Thu, Aug 12, 2021 at 11:22:51PM +0300, Kirill A. Shutemov wrote:
> On Thu, Aug 12, 2021 at 12:33:11PM -0700, Andi Kleen wrote:
> > I think Joerg's question was if TDX has a single ABI for all hypervisors.
> > The GHCI specification supports both hypervisor specific and hypervisor
> > agnostic calls. But these basic operations like MapGPA are all hypervisor
> > agnostic. The only differences would be in the existing hypervisor specific
> > PV code.
> 
> My point was that TDX and SEV-SNP going to be different and we need a way
> to hook the right protocol for each.

Yeah, okay, thanks for the clarification. My misunderstanding was that
there could be per-hypervisor contract on what memory is pre-accepted
and what Linux is responsible for.

Thanks,

	Joerg

[0/5] x86: Impplement support for unaccepted memory

Message

Comments