Message ID | 20210810062626.1012-1-kirill.shutemov@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | x86: Impplement support for unaccepted memory | expand |
On 8/9/21 11:26 PM, Kirill A. Shutemov wrote: > UEFI Specification version 2.9 introduces concept of memory acceptance: > Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, > requiring memory to be accepted before it can be used by the guest. > Accepting happens via a protocol specific for the Virtrual Machine > platform. > > Accepting memory is costly and it makes VMM allocate memory for the > accepted guest physical address range. We don't want to accept all memory > upfront. This could use a bit more explanation. Any VM is likely to *eventually* touch all its memory, so it's not like a VMM has a long-term advantage by delaying this. So, it must have to do with resource use at boot. Is this to help boot times? I had expected this series, but I also expected it to be connected to CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow. Could you explain a bit how this problem is different and demands a totally orthogonal solution? For instance, what prevents us from declaring: "Memory is accepted at the time that its 'struct page' is initialized" ? Then, we use all the infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT. This series isn't too onerous, but I do want to make sure that we're not reinventing the wheel.
On Tue, Aug 10, 2021 at 07:08:58AM -0700, Dave Hansen wrote: > On 8/9/21 11:26 PM, Kirill A. Shutemov wrote: > > UEFI Specification version 2.9 introduces concept of memory acceptance: > > Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, > > requiring memory to be accepted before it can be used by the guest. > > Accepting happens via a protocol specific for the Virtrual Machine > > platform. > > > > Accepting memory is costly and it makes VMM allocate memory for the > > accepted guest physical address range. We don't want to accept all memory > > upfront. > > This could use a bit more explanation. Any VM is likely to *eventually* > touch all its memory, so it's not like a VMM has a long-term advantage > by delaying this. > > So, it must have to do with resource use at boot. Is this to help boot > times? Yes, boot time is main motivation. But I'm going also to look at long-term VM behaviour with the fixed memory footprint. I think if a workload allocate/free memory within the same amount we can keep memory beyond the size unaccepted. Few tweaks likely will be required such as disabling page shuffling on free to keep unaccepted memory at the tail of free list. More investigation needed. > I had expected this series, but I also expected it to be connected to > CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow. Could you explain a bit how > this problem is different and demands a totally orthogonal solution? > > For instance, what prevents us from declaring: "Memory is accepted at > the time that its 'struct page' is initialized" ? Then, we use all the > infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT. That was my first thought too and I tried it just to realize that it is not what we want. If we would accept page on page struct init it means we would make host allocate all memory assigned to the guest on boot even if guest actually use small portion of it. Also deferred page init only allows to scale memory accept across multiple CPUs, but doesn't allow to get to userspace before we done with it. See wait_for_completion(&pgdat_init_all_done_comp).
On 8/10/21 8:15 AM, Kirill A. Shutemov wrote: > On Tue, Aug 10, 2021 at 07:08:58AM -0700, Dave Hansen wrote: >> On 8/9/21 11:26 PM, Kirill A. Shutemov wrote: >>> UEFI Specification version 2.9 introduces concept of memory acceptance: >>> Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, >>> requiring memory to be accepted before it can be used by the guest. >>> Accepting happens via a protocol specific for the Virtrual Machine >>> platform. >>> >>> Accepting memory is costly and it makes VMM allocate memory for the >>> accepted guest physical address range. We don't want to accept all memory >>> upfront. >> >> This could use a bit more explanation. Any VM is likely to *eventually* >> touch all its memory, so it's not like a VMM has a long-term advantage >> by delaying this. >> >> So, it must have to do with resource use at boot. Is this to help boot >> times? > > Yes, boot time is main motivation. > > But I'm going also to look at long-term VM behaviour with the fixed memory > footprint. I think if a workload allocate/free memory within the same > amount we can keep memory beyond the size unaccepted. Few tweaks likely > will be required such as disabling page shuffling on free to keep > unaccepted memory at the tail of free list. More investigation needed. OK, so this is predicated on the idea that a guest will not use all of its assigned RAM and that the host will put that RAM to good use elsewhere. Right? That's undercut by a a few of factors: 1. Many real-world cloud providers do not overcommit RAM. If the guest does not use the RAM, it goes to waste. (Yes, there are providers that overcommit, but we're talking generally about places where this proposal is useful). 2. Long-term, RAM fills up with page cache in many scenarios So, this is really only beneficial for long-term host physical memory use if: 1. The host is overcommitting and 2. The guest never uses all of its RAM Seeing as TDX doesn't support swap and can't coexist with persistent memory, the only recourse for folks overcommitting TDX VMs when the run out of RAM is to kill VMs. I can't imagine that a lot of folks are going to do this. In other words, I buy the boot speed argument. But, I don't buy the "this saves memory long term" argument at all. >> I had expected this series, but I also expected it to be connected to >> CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow. Could you explain a bit how >> this problem is different and demands a totally orthogonal solution? >> >> For instance, what prevents us from declaring: "Memory is accepted at >> the time that its 'struct page' is initialized" ? Then, we use all the >> infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT. > > That was my first thought too and I tried it just to realize that it is > not what we want. If we would accept page on page struct init it means we > would make host allocate all memory assigned to the guest on boot even if > guest actually use small portion of it. > > Also deferred page init only allows to scale memory accept across multiple > CPUs, but doesn't allow to get to userspace before we done with it. See > wait_for_completion(&pgdat_init_all_done_comp). That's good information. It's a refinement of the "I want to boot faster" requirement. What you want is not just going _faster_, but being able to run userspace before full acceptance has completed. Would you be able to quantify how fast TDX page acceptance is? Are we talking about MB/s, GB/s, TB/s? This series is rather bereft of numbers for a feature which making a performance claim. Let's say we have a 128GB VM. How much does faster does this approach reach userspace than if all memory was accepted up front? How much memory _could_ have been accepted at the point userspace starts running?
On Tue, Aug 10, 2021 at 08:51:01AM -0700, Dave Hansen wrote: > In other words, I buy the boot speed argument. But, I don't buy the > "this saves memory long term" argument at all. Okay, that's a fair enough. I guess there's *some* workloads that may have memory footprint reduced, but I agree it's minority. > >> I had expected this series, but I also expected it to be connected to > >> CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow. Could you explain a bit how > >> this problem is different and demands a totally orthogonal solution? > >> > >> For instance, what prevents us from declaring: "Memory is accepted at > >> the time that its 'struct page' is initialized" ? Then, we use all the > >> infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT. > > > > That was my first thought too and I tried it just to realize that it is > > not what we want. If we would accept page on page struct init it means we > > would make host allocate all memory assigned to the guest on boot even if > > guest actually use small portion of it. > > > > Also deferred page init only allows to scale memory accept across multiple > > CPUs, but doesn't allow to get to userspace before we done with it. See > > wait_for_completion(&pgdat_init_all_done_comp). > > That's good information. It's a refinement of the "I want to boot > faster" requirement. What you want is not just going _faster_, but > being able to run userspace before full acceptance has completed. > > Would you be able to quantify how fast TDX page acceptance is? Are we > talking about MB/s, GB/s, TB/s? This series is rather bereft of numbers > for a feature which making a performance claim. > > Let's say we have a 128GB VM. How much does faster does this approach > reach userspace than if all memory was accepted up front? How much > memory _could_ have been accepted at the point userspace starts running? Acceptance code is not optimized yet: we accept memory in 4k chunk which is very slow because hypercall overhead dominates the picture. As of now, kernel boot time of 1 VCPU and 64TiB VM with upfront memory accept is >20 times slower than with this lazy memory accept approach. The difference is going to be substantially lower once we get it optimized properly.
On 8/10/21 10:31 AM, Kirill A. Shutemov wrote: > On Tue, Aug 10, 2021 at 08:51:01AM -0700, Dave Hansen wrote: >> Let's say we have a 128GB VM. How much does faster does this approach >> reach userspace than if all memory was accepted up front? How much >> memory _could_ have been accepted at the point userspace starts running? > > Acceptance code is not optimized yet: we accept memory in 4k chunk which > is very slow because hypercall overhead dominates the picture. > > As of now, kernel boot time of 1 VCPU and 64TiB VM with upfront memory > accept is >20 times slower than with this lazy memory accept approach. That's a pretty big deal. > The difference is going to be substantially lower once we get it optimized > properly. What does this mean? Is this future work in the kernel or somewhere in the TDX hardware/firmware which will speed things up?
On Tue, Aug 10, 2021 at 10:36:21AM -0700, Dave Hansen wrote: > > The difference is going to be substantially lower once we get it optimized > > properly. > > What does this mean? Is this future work in the kernel or somewhere in > the TDX hardware/firmware which will speed things up? Kernel has to be changed to accept memory in 2M and 1G chunks where possible. The interface exists and described in spec, but not yet used in guest kernel. It would cut hypercall overhead dramatically. It makes upfront memory accept more bearable and lowers latency of lazy memory accept. So I expect the gap being not 20x, but like 3-5x (which is still huge).
On 8/10/21 10:51 AM, Kirill A. Shutemov wrote: > On Tue, Aug 10, 2021 at 10:36:21AM -0700, Dave Hansen wrote: >>> The difference is going to be substantially lower once we get it optimized >>> properly. >> What does this mean? Is this future work in the kernel or somewhere in >> the TDX hardware/firmware which will speed things up? > Kernel has to be changed to accept memory in 2M and 1G chunks where > possible. The interface exists and described in spec, but not yet used in > guest kernel. From a quick scan of the spec, I only see: > 7.9.3. Page Acceptance by the Guest TD: TDG.MEM.PAGE.ACCEPT ... The guest > TD can accept a dynamically added 4KB page using TDG.MEM.PAGE.ACCEPT > with the page GPA as an input. Is there some other 2M/1G page-acceptance call that I'm missing? > It would cut hypercall overhead dramatically. It makes upfront memory > accept more bearable and lowers latency of lazy memory accept. So I expect > the gap being not 20x, but like 3-5x (which is still huge). It would be nice to be able to judge the benefits of this series based on the final form. I guess we'll take what we can get, though. Either way, I'd still like to see some *actual* numbers for at least one configuration: With this series applied, userspace starts to run at X seconds after kernel boot. Without this series, userspace runs at Y seconds.
On Tue, Aug 10, 2021 at 11:19:30AM -0700, Dave Hansen wrote: > On 8/10/21 10:51 AM, Kirill A. Shutemov wrote: > > On Tue, Aug 10, 2021 at 10:36:21AM -0700, Dave Hansen wrote: > >>> The difference is going to be substantially lower once we get it optimized > >>> properly. > >> What does this mean? Is this future work in the kernel or somewhere in > >> the TDX hardware/firmware which will speed things up? > > Kernel has to be changed to accept memory in 2M and 1G chunks where > > possible. The interface exists and described in spec, but not yet used in > > guest kernel. > > From a quick scan of the spec, I only see: > > > 7.9.3. Page Acceptance by the Guest TD: TDG.MEM.PAGE.ACCEPT ... The guest > > TD can accept a dynamically added 4KB page using TDG.MEM.PAGE.ACCEPT > > with the page GPA as an input. > Is there some other 2M/1G page-acceptance call that I'm missing? I referred to GHCI[1], section 2.4.7. RDX=0 is 4k, RDX=1 is 2M and RDX=2 is 1G. Public specs have mismatches. I hope it will get sorted out soon. :/ [1] https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf > > It would cut hypercall overhead dramatically. It makes upfront memory > > accept more bearable and lowers latency of lazy memory accept. So I expect > > the gap being not 20x, but like 3-5x (which is still huge). > > It would be nice to be able to judge the benefits of this series based > on the final form. I guess we'll take what we can get, though. > > Either way, I'd still like to see some *actual* numbers for at least one > configuration: > > With this series applied, userspace starts to run at X seconds > after kernel boot. Without this series, userspace runs at Y > seconds. Getting absolute numbers in public for unreleased product is tricky. I hoped to get away with ratios or percentage of difference.
Hi Kirill, On Tue, Aug 10, 2021 at 09:26:21AM +0300, Kirill A. Shutemov wrote: > Accepting happens via a protocol specific for the Virtrual Machine > platform. That sentence bothers me a bit. Can you explain what it VMM specific in the acceptance protocol? I want to avoid having to implement VMM specific acceptance protocols. Regards, Joerg
On Thu, Aug 12, 2021 at 10:23:24AM +0200, Joerg Roedel wrote: > Hi Kirill, > > On Tue, Aug 10, 2021 at 09:26:21AM +0300, Kirill A. Shutemov wrote: > > Accepting happens via a protocol specific for the Virtrual Machine > > platform. > > That sentence bothers me a bit. Can you explain what it VMM specific in > the acceptance protocol? For TDX we have a signle MapGPA hypercall to VMM plus TDAcceptPage for every accepted page to TDX Module. SEV-SNP has to something similar.
On 8/12/2021 3:10 AM, Kirill A. Shutemov wrote: > On Thu, Aug 12, 2021 at 10:23:24AM +0200, Joerg Roedel wrote: >> Hi Kirill, >> >> On Tue, Aug 10, 2021 at 09:26:21AM +0300, Kirill A. Shutemov wrote: >>> Accepting happens via a protocol specific for the Virtrual Machine >>> platform. >> That sentence bothers me a bit. Can you explain what it VMM specific in >> the acceptance protocol? > For TDX we have a signle MapGPA hypercall to VMM plus TDAcceptPage for > every accepted page to TDX Module. SEV-SNP has to something similar. I think Joerg's question was if TDX has a single ABI for all hypervisors. The GHCI specification supports both hypervisor specific and hypervisor agnostic calls. But these basic operations like MapGPA are all hypervisor agnostic. The only differences would be in the existing hypervisor specific PV code. -Andi
On Thu, Aug 12, 2021 at 12:33:11PM -0700, Andi Kleen wrote: > > On 8/12/2021 3:10 AM, Kirill A. Shutemov wrote: > > On Thu, Aug 12, 2021 at 10:23:24AM +0200, Joerg Roedel wrote: > > > Hi Kirill, > > > > > > On Tue, Aug 10, 2021 at 09:26:21AM +0300, Kirill A. Shutemov wrote: > > > > Accepting happens via a protocol specific for the Virtrual Machine > > > > platform. > > > That sentence bothers me a bit. Can you explain what it VMM specific in > > > the acceptance protocol? > > For TDX we have a signle MapGPA hypercall to VMM plus TDAcceptPage for > > every accepted page to TDX Module. SEV-SNP has to something similar. > > > I think Joerg's question was if TDX has a single ABI for all hypervisors. > The GHCI specification supports both hypervisor specific and hypervisor > agnostic calls. But these basic operations like MapGPA are all hypervisor > agnostic. The only differences would be in the existing hypervisor specific > PV code. My point was that TDX and SEV-SNP going to be different and we need a way to hook the right protocol for each.
On Thu, Aug 12, 2021 at 11:22:51PM +0300, Kirill A. Shutemov wrote: > On Thu, Aug 12, 2021 at 12:33:11PM -0700, Andi Kleen wrote: > > I think Joerg's question was if TDX has a single ABI for all hypervisors. > > The GHCI specification supports both hypervisor specific and hypervisor > > agnostic calls. But these basic operations like MapGPA are all hypervisor > > agnostic. The only differences would be in the existing hypervisor specific > > PV code. > > My point was that TDX and SEV-SNP going to be different and we need a way > to hook the right protocol for each. Yeah, okay, thanks for the clarification. My misunderstanding was that there could be per-hypervisor contract on what memory is pre-accepted and what Linux is responsible for. Thanks, Joerg