[v4,0/2] use vmalloc to allocate vmx vcpus

Message ID	20181026075900.111462-1-marcorr@google.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> Date: Fri, 26 Oct 2018 00:58:58 -0700 Message-Id: <20181026075900.111462-1-marcorr@google.com> Mime-Version: 1.0 Subject: [kvm PATCH v4 0/2] use vmalloc to allocate vmx vcpus From: Marc Orr <marcorr@google.com> To: kvm@vger.kernel.org, jmattson@google.com, rientjes@google.com, konrad.wilk@oracle.com, linux-mm@kvack.org, akpm@linux-foundation.org, pbonzini@redhat.com, rkrcmar@redhat.com, willy@infradead.org, sean.j.christopherson@intel.com Cc: Marc Orr <marcorr@google.com> Content-Type: text/plain; charset="UTF-8" Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	use vmalloc to allocate vmx vcpus \| expand [v4,0/2] use vmalloc to allocate vmx vcpus [v4,1/2] kvm: vmx: refactor vmx_msrs struct for vmalloc [v4,2/2] kvm: vmx: use vmalloc() to allocate vcpus

Marc Orr Oct. 26, 2018, 7:58 a.m. UTC

A couple of patches to allocate vmx vcpus with vmalloc instead of
kalloc, which enables vcpu allocation to succeeed when contiguous
physical memory is sparse.

Compared to the last version of these patches, this version:
1. Splits out the refactoring of the vmx_msrs struct into it's own
patch, as suggested by Sean Christopherson <sean.j.christopherson@intel.com>.
2. Leverages the __vmalloc() API rather than introducing a new vzalloc()
API, as suggested by Michal Hocko <mhocko@kernel.org>.

Marc Orr (2):
  kvm: vmx: refactor vmx_msrs struct for vmalloc
  kvm: vmx: use vmalloc() to allocate vcpus

 arch/x86/kvm/vmx.c  | 92 +++++++++++++++++++++++++++++++++++++++++----
 virt/kvm/kvm_main.c | 28 ++++++++------
 2 files changed, 100 insertions(+), 20 deletions(-)

Matthew Wilcox Oct. 26, 2018, 12:29 p.m. UTC | #1

On Fri, Oct 26, 2018 at 12:58:58AM -0700, Marc Orr wrote:
> A couple of patches to allocate vmx vcpus with vmalloc instead of
> kalloc, which enables vcpu allocation to succeeed when contiguous
> physical memory is sparse.

A question that may have been asked already, but if so I didn't see it ...
does kvm_vcpu need to be so damn big?  It's 22kB with the random .config
I happen to have (which gets rounded to 32kB, an order-3 allocation).  If
we can knock 6kB off it (either by allocating pieces separately), it becomes
an order-2 allocation.  So, biggest chunks:

        struct kvm_vcpu_arch       arch;                 /*   576 21568 */

        struct kvm_mmu             mmu;                  /*   336   400 */
        struct kvm_mmu             nested_mmu;           /*   736   400 */
        struct fpu                 user_fpu;             /*  2176  4160 */
        struct fpu                 guest_fpu;            /*  6336  4160 */
        struct kvm_cpuid_entry2    cpuid_entries[80];    /* 10580  3200 */
        struct x86_emulate_ctxt    emulate_ctxt;         /* 13792  2656 */
        struct kvm_pmu             pmu;                  /* 17344  1520 */
        struct kvm_vcpu_hv         hyperv;               /* 18872  1872 */
                gfn_t              gfns[64];             /* 20832   512 */

that's about 19kB of the 22kB right there.  Can any of them be shrunk
in size or allocated separately?

Matthew Wilcox Oct. 26, 2018, 2:45 p.m. UTC | #2

On Fri, Oct 26, 2018 at 05:29:48AM -0700, Matthew Wilcox wrote:
> A question that may have been asked already, but if so I didn't see it ...
> does kvm_vcpu need to be so damn big?  It's 22kB with the random .config
> I happen to have (which gets rounded to 32kB, an order-3 allocation).  If
> we can knock 6kB off it (either by allocating pieces separately), it becomes
> an order-2 allocation.  So, biggest chunks:
> 
>         struct kvm_vcpu_arch       arch;                 /*   576 21568 */
> 
>         struct kvm_mmu             mmu;                  /*   336   400 */
>         struct kvm_mmu             nested_mmu;           /*   736   400 */
>         struct fpu                 user_fpu;             /*  2176  4160 */
>         struct fpu                 guest_fpu;            /*  6336  4160 */
>         struct kvm_cpuid_entry2    cpuid_entries[80];    /* 10580  3200 */
>         struct x86_emulate_ctxt    emulate_ctxt;         /* 13792  2656 */
>         struct kvm_pmu             pmu;                  /* 17344  1520 */
>         struct kvm_vcpu_hv         hyperv;               /* 18872  1872 */
>                 gfn_t              gfns[64];             /* 20832   512 */
> 
> that's about 19kB of the 22kB right there.  Can any of them be shrunk
> in size or allocated separately?

I just had a fun conversation with Dave Hansen (cc'd) in which he
indicated that the fpu should definitely be dynamically allocated.
See also his commits from 2015 around XSAVE.

0c8c0f03e3a292e031596484275c14cf39c0ab7a
4109ca066b6b899ac7549bf3aac94b178ac95891

Dave Hansen Oct. 26, 2018, 2:49 p.m. UTC | #3

On 10/26/18 7:45 AM, Matthew Wilcox wrote:
>         struct fpu                 user_fpu;             /*  2176  4160 */
>         struct fpu                 guest_fpu;            /*  6336  4160 */

Those are *not* supposed to be embedded in any other structures.  My bad
for not documenting this better.

It also seems really goofy that we need an xsave buffer in the
task_struct for user fpu state, then another in the vcpu.  Isn't one for
user state enough?

In any case, I'd suggest getting rid of 'user_fpu', then either moving
'guest_fpu' to the bottom of the structure, or just make it a 'struct
fpu *' and dynamically allocating it separately.

To do this, I'd take fpu__init_task_struct_size(), and break it apart a
bit to tell you the size of the 'struct fpu' separately from the size of
the 'task struct'.

Wanpeng Li Oct. 29, 2018, 1:58 a.m. UTC | #4

On Fri, 26 Oct 2018 at 15:59, Marc Orr <marcorr@google.com> wrote:
>
> A couple of patches to allocate vmx vcpus with vmalloc instead of
> kalloc, which enables vcpu allocation to succeeed when contiguous
> physical memory is sparse.

We have not yet encounter memory is too fragmented to allocate kvm
related metadata in our overcommit pools, is this true requirement
from the product environments?

Regards,
Wanpeng Li

>
> Compared to the last version of these patches, this version:
> 1. Splits out the refactoring of the vmx_msrs struct into it's own
> patch, as suggested by Sean Christopherson <sean.j.christopherson@intel.com>.
> 2. Leverages the __vmalloc() API rather than introducing a new vzalloc()
> API, as suggested by Michal Hocko <mhocko@kernel.org>.
>
> Marc Orr (2):
>   kvm: vmx: refactor vmx_msrs struct for vmalloc
>   kvm: vmx: use vmalloc() to allocate vcpus
>
>  arch/x86/kvm/vmx.c  | 92 +++++++++++++++++++++++++++++++++++++++++----
>  virt/kvm/kvm_main.c | 28 ++++++++------
>  2 files changed, 100 insertions(+), 20 deletions(-)
>
> --
> 2.19.1.568.g152ad8e336-goog
>

Jim Mattson Oct. 29, 2018, 4:25 p.m. UTC | #5

On Sun, Oct 28, 2018 at 6:58 PM, Wanpeng Li <kernellwp@gmail.com> wrote:
> We have not yet encounter memory is too fragmented to allocate kvm
> related metadata in our overcommit pools, is this true requirement
> from the product environments?

Yes.

Matthew Wilcox Oct. 29, 2018, 4:48 p.m. UTC | #6

On Mon, Oct 29, 2018 at 09:25:05AM -0700, Jim Mattson wrote:
> On Sun, Oct 28, 2018 at 6:58 PM, Wanpeng Li <kernellwp@gmail.com> wrote:
> > We have not yet encounter memory is too fragmented to allocate kvm
> > related metadata in our overcommit pools, is this true requirement
> > from the product environments?
> 
> Yes.

Are your logs granular enough to determine if turning this into an
order-2 allocation (by splitting out "struct fpu" allocations) will be
sufficient to resolve your problem, or do we need to turn it into an
order-1 or vmalloc allocation to achieve your production goals?

Jim Mattson Oct. 29, 2018, 6:12 p.m. UTC | #7

On Mon, Oct 29, 2018 at 9:48 AM, Matthew Wilcox <willy@infradead.org> wrote:
> On Mon, Oct 29, 2018 at 09:25:05AM -0700, Jim Mattson wrote:
>> On Sun, Oct 28, 2018 at 6:58 PM, Wanpeng Li <kernellwp@gmail.com> wrote:
>> > We have not yet encounter memory is too fragmented to allocate kvm
>> > related metadata in our overcommit pools, is this true requirement
>> > from the product environments?
>>
>> Yes.
>
> Are your logs granular enough to determine if turning this into an
> order-2 allocation (by splitting out "struct fpu" allocations) will be
> sufficient to resolve your problem, or do we need to turn it into an
> order-1 or vmalloc allocation to achieve your production goals?

Turning this into an order-2 allocation should suffice.

Marc Orr Oct. 29, 2018, 7:16 p.m. UTC | #8

Thanks for all the discussion on this. Give me a bit to investigate
Dave's suggestions around refactoring the fpu state, and I'll report
back with what I find.
Thanks,
Marc
On Mon, Oct 29, 2018 at 11:12 AM Jim Mattson <jmattson@google.com> wrote:
>
> On Mon, Oct 29, 2018 at 9:48 AM, Matthew Wilcox <willy@infradead.org> wrote:
> > On Mon, Oct 29, 2018 at 09:25:05AM -0700, Jim Mattson wrote:
> >> On Sun, Oct 28, 2018 at 6:58 PM, Wanpeng Li <kernellwp@gmail.com> wrote:
> >> > We have not yet encounter memory is too fragmented to allocate kvm
> >> > related metadata in our overcommit pools, is this true requirement
> >> > from the product environments?
> >>
> >> Yes.
> >
> > Are your logs granular enough to determine if turning this into an
> > order-2 allocation (by splitting out "struct fpu" allocations) will be
> > sufficient to resolve your problem, or do we need to turn it into an
> > order-1 or vmalloc allocation to achieve your production goals?
>
> Turning this into an order-2 allocation should suffice.

Marc Orr Oct. 29, 2018, 7:22 p.m. UTC | #9

On Mon, Oct 29, 2018 at 12:16 PM Marc Orr <marcorr@google.com> wrote:
>
> Thanks for all the discussion on this. Give me a bit to investigate
> Dave's suggestions around refactoring the fpu state, and I'll report
> back with what I find.
> Thanks,
> Marc

Also, sorry for top-posting on my last email!

Marc Orr Oct. 31, 2018, 1:06 p.m. UTC | #10

On Fri, Oct 26, 2018 at 7:49 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/26/18 7:45 AM, Matthew Wilcox wrote:
> >         struct fpu                 user_fpu;             /*  2176  4160 */
> >         struct fpu                 guest_fpu;            /*  6336  4160 */
>
> Those are *not* supposed to be embedded in any other structures.  My bad
> for not documenting this better.
>
> It also seems really goofy that we need an xsave buffer in the
> task_struct for user fpu state, then another in the vcpu.  Isn't one for
> user state enough?
>
> In any case, I'd suggest getting rid of 'user_fpu', then either moving
> 'guest_fpu' to the bottom of the structure, or just make it a 'struct
> fpu *' and dynamically allocating it separately.

I've written a patch to get rid of user_fpu, as suggested here and
will be sending that out shortly.

>
> To do this, I'd take fpu__init_task_struct_size(), and break it apart a
> bit to tell you the size of the 'struct fpu' separately from the size of
> the 'task struct'.

I've written a 2nd patch to make guest_cpu  a 'struct fpu *' and
dynamically allocate it separately. The reason I went with this
suggestion, rather than moving 'struct fpu' to the bottom of
kvm_vcpu_arch is because I believe that solution would still expand
the kvm_vcpu_arch by the size of the fpu, according to which
fpregs_state was in use.

Marc Orr Oct. 31, 2018, 1:17 p.m. UTC | #11

On Mon, Oct 29, 2018 at 9:48 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Oct 29, 2018 at 09:25:05AM -0700, Jim Mattson wrote:
> > On Sun, Oct 28, 2018 at 6:58 PM, Wanpeng Li <kernellwp@gmail.com> wrote:
> > > We have not yet encounter memory is too fragmented to allocate kvm
> > > related metadata in our overcommit pools, is this true requirement
> > > from the product environments?
> >
> > Yes.
>
> Are your logs granular enough to determine if turning this into an
> order-2 allocation (by splitting out "struct fpu" allocations) will be
> sufficient to resolve your problem, or do we need to turn it into an
> order-1 or vmalloc allocation to achieve your production goals?

As noted in my response to Dave Hansen, I've got his suggestions done
and they were successful in drastically reducing the size of the
vcpu_vmx struct, which is great. Specifically, on an upstream kernel,
I've reduced the size of the struct from 23680 down to 15168, which is
order 2.

All that being said, I don't really understand why we wouldn't convert
this memory allocation from a kmalloc() into a vmalloc(). From my
point of view, we are still close to bloating vcpu_vmx into an order 3
allocation, and it's common for vendors to append to both vcpu_vmx
directly, or more likely to its embedded structs. Though, arguably,
vendors should not be doing that.

Most importantly, it just isn't obvious to me why kmalloc() is
preferred over vmalloc(). From my point of view, vmalloc() does the
exact same thing as kmalloc(), except that it works when contiguous
memory is sparse, which seems better to me.

Matthew Wilcox Oct. 31, 2018, 1:27 p.m. UTC | #12

On Wed, Oct 31, 2018 at 01:17:40PM +0000, Marc Orr wrote:
> All that being said, I don't really understand why we wouldn't convert
> this memory allocation from a kmalloc() into a vmalloc(). From my
> point of view, we are still close to bloating vcpu_vmx into an order 3
> allocation, and it's common for vendors to append to both vcpu_vmx
> directly, or more likely to its embedded structs. Though, arguably,
> vendors should not be doing that.
> 
> Most importantly, it just isn't obvious to me why kmalloc() is
> preferred over vmalloc(). From my point of view, vmalloc() does the
> exact same thing as kmalloc(), except that it works when contiguous
> memory is sparse, which seems better to me.

It's ever-so-slightly faster to access kmalloc memory than vmalloc memory;
kmalloc memory comes from the main kernel mapping, generally mapped with
1GB pages while vmalloc memory is necessarily accessed using 4kB pages,
taking an extra two steps in the page table hierarchy.  There's more
software overhead involved too; in addition to allocating the physical
pages (which both kmalloc and vmalloc have to do), vmalloc has to allocate
a vmap_area and a vm_struct to describe the virtual mapping.

The virtual address space can also get fragmented, just like the physical
address space does, potentially leading to the amusing scenario where
you can allocate physically contiguous memory, but not find a contiguous
chunk of vmalloc space to put it in.

For larger allocations, we tend to prefer kvmalloc() which gives us
the best of both worlds, allocating from kmalloc first and vmalloc as a
fallback, but you've got some fun gyrations to go through to do physical
mapping, so that's not entirely appropriate for your case.

Marc Orr Oct. 31, 2018, 1:48 p.m. UTC | #13

On Wed, Oct 31, 2018 at 6:27 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 31, 2018 at 01:17:40PM +0000, Marc Orr wrote:
> > All that being said, I don't really understand why we wouldn't convert
> > this memory allocation from a kmalloc() into a vmalloc(). From my
> > point of view, we are still close to bloating vcpu_vmx into an order 3
> > allocation, and it's common for vendors to append to both vcpu_vmx
> > directly, or more likely to its embedded structs. Though, arguably,
> > vendors should not be doing that.
> >
> > Most importantly, it just isn't obvious to me why kmalloc() is
> > preferred over vmalloc(). From my point of view, vmalloc() does the
> > exact same thing as kmalloc(), except that it works when contiguous
> > memory is sparse, which seems better to me.
>
> It's ever-so-slightly faster to access kmalloc memory than vmalloc memory;
> kmalloc memory comes from the main kernel mapping, generally mapped with
> 1GB pages while vmalloc memory is necessarily accessed using 4kB pages,
> taking an extra two steps in the page table hierarchy.  There's more
> software overhead involved too; in addition to allocating the physical
> pages (which both kmalloc and vmalloc have to do), vmalloc has to allocate
> a vmap_area and a vm_struct to describe the virtual mapping.
>
> The virtual address space can also get fragmented, just like the physical
> address space does, potentially leading to the amusing scenario where
> you can allocate physically contiguous memory, but not find a contiguous
> chunk of vmalloc space to put it in.
>
> For larger allocations, we tend to prefer kvmalloc() which gives us
> the best of both worlds, allocating from kmalloc first and vmalloc as a
> fallback, but you've got some fun gyrations to go through to do physical
> mapping, so that's not entirely appropriate for your case.

Thanks for the explanation. Is there a way to dynamically detect the
memory allocation done by kvmalloc() (i.e., kmalloc() or vmalloc())?
If so, we could use kvmalloc(), and add two code paths to do the
physical mapping, according to whether the underlying memory was
allocated with kmalloc() or vmalloc().

Matthew Wilcox Oct. 31, 2018, 2:21 p.m. UTC | #14

On Wed, Oct 31, 2018 at 01:48:44PM +0000, Marc Orr wrote:
> Thanks for the explanation. Is there a way to dynamically detect the
> memory allocation done by kvmalloc() (i.e., kmalloc() or vmalloc())?
> If so, we could use kvmalloc(), and add two code paths to do the
> physical mapping, according to whether the underlying memory was
> allocated with kmalloc() or vmalloc().

Yes -- it's used in the implementation of kvfree():

        if (is_vmalloc_addr(addr))
                vfree(addr);
        else
                kfree(addr);

Marc Orr Oct. 31, 2018, 9:19 p.m. UTC | #15

On Wed, Oct 31, 2018 at 7:21 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 31, 2018 at 01:48:44PM +0000, Marc Orr wrote:
> > Thanks for the explanation. Is there a way to dynamically detect the
> > memory allocation done by kvmalloc() (i.e., kmalloc() or vmalloc())?
> > If so, we could use kvmalloc(), and add two code paths to do the
> > physical mapping, according to whether the underlying memory was
> > allocated with kmalloc() or vmalloc().
>
> Yes -- it's used in the implementation of kvfree():
>
>         if (is_vmalloc_addr(addr))
>                 vfree(addr);
>         else
>                 kfree(addr);
>

I can drop the vmalloc() patches (unless someone else thinks we should
proceed with a kvmalloc() version). I discussed them with my
colleagues, and the consensus on our end is we shouldn't let these
structs bloat so big. Thanks for everyone's help to reduce them by 2x
the fpu struct! I'll be sending out another version of the patch
series, with the two fpu patches and minus the vmalloc() patches,
after I hear back from Dave on a question I just sent.

[v4,0/2] use vmalloc to allocate vmx vcpus

Message

Comments