[RFC,v2,0/4] speed up page allocation for __GFP_ZERO

Message ID	20201221162519.GA22504@open-light-1.localdomain (mailing list archive)
Headers	show Return-Path: <SRS0=6rh8=FZ=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EDB3022B51 From: Liang Li <liliang.opensource@gmail.com> Date: Mon, 21 Dec 2020 11:25:22 -0500 To: Alexander Duyck <alexander.h.duyck@linux.intel.com>, Mel Gorman <mgorman@techsingularity.net>, Andrew Morton <akpm@linux-foundation.org>, Andrea Arcangeli <aarcange@redhat.com>, Dan Williams <dan.j.williams@intel.com>, "Michael S. Tsirkin" <mst@redhat.com>, David Hildenbrand <david@redhat.com>, Jason Wang <jasowang@redhat.com>, Dave Hansen <dave.hansen@intel.com>, Michal Hocko <mhocko@suse.com>, Liang Li <liliangleo@didiglobal.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org Subject: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO Message-ID: <20201221162519.GA22504@open-light-1.localdomain> Mail-Followup-To: Alexander Duyck <alexander.h.duyck@linux.intel.com>, Mel Gorman <mgorman@techsingularity.net>, Andrew Morton <akpm@linux-foundation.org>, Andrea Arcangeli <aarcange@redhat.com>, Dan Williams <dan.j.williams@intel.com>, "Michael S. Tsirkin" <mst@redhat.com>, David Hildenbrand <david@redhat.com>, Jason Wang <jasowang@redhat.com>, Dave Hansen <dave.hansen@intel.com>, Michal Hocko <mhocko@suse.com>, Liang Li <liliangleo@didiglobal.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	speed up page allocation for __GFP_ZERO \| expand [RFC,v2,0/4] speed up page allocation for __GFP_ZERO [RFC,v2,1/4] mm: make page reporing worker works better for low order page [RFC,v2,2/4] mm: Add batch size for free page reporting [RFC,v2,3/4] mm: let user decide page reporting option [RFC,v2,4/4] mm: pre zero out free pages to speed up page allocation for __GFP_ZERO

Liang Li Dec. 21, 2020, 4:25 p.m. UTC

The first version can be found at: https://lkml.org/lkml/2020/4/12/42

Zero out the page content usually happens when allocating pages with
the flag of __GFP_ZERO, this is a time consuming operation, it makes
the population of a large vma area very slowly. This patch introduce
a new feature for zero out pages before page allocation, it can help
to speed up page allocation with __GFP_ZERO.

My original intention for adding this feature is to shorten VM
creation time when SR-IOV devicde is attached, it works good and the
VM creation time is reduced by about 90%.

Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
=====================================================
QEMU use 4K pages, THP is off
                  round1      round2      round3
w/o this patch:    23.5s       24.7s       24.6s
w/ this patch:     10.2s       10.3s       11.2s

QEMU use 4K pages, THP is on
                  round1      round2      round3
w/o this patch:    17.9s       14.8s       14.9s
w/ this patch:     1.9s        1.8s        1.9s
=====================================================

Obviously, it can do more than this. We can benefit from this feature
in the flowing case:

Interactive sence
=================
Shorten application lunch time on desktop or mobile phone, it can help
to improve the user experience. Test shows on a
server [Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz], zero out 1GB RAM by
the kernel will take about 200ms, while some mainly used application
like Firefox browser, Office will consume 100 ~ 300 MB RAM just after
launch, by pre zero out free pages, it means the application launch
time will be reduced about 20~60ms (can be visual sensed?). May be
we can make use of this feature to speed up the launch of Andorid APP
(I didn't do any test for Android).

Virtulization
=============
Speed up VM creation and shorten guest boot time, especially for PCI
SR-IOV device passthrough scenario. Compared with some of the para
vitalization solutions, it is easy to deploy because it’s transparent
to guest and can handle DMA properly in BIOS stage, while the para
virtualization solution can’t handle it well.

Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
to the VMM, VMM will unmap the corresponding host page for reclaim,
when guest allocate a page just reclaimed, host will allocate a new page
and zero it out for guest, in this case pre zero out free page will help
to speed up the proccess of fault in and reduce the performance impaction.

Speed up kernel routine
=======================
This can’t be guaranteed because we don’t pre zero out all the free pages,
but is true for most case. It can help to speed up some important system
call just like fork, which will allocate zero pages for building page
table. And speed up the process of page fault, especially for huge page
fault. The POC of Hugetlb free page pre zero out has been done.

Security
========
This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
boot options", which zero out page in a asynchronous way. For users can't
tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
this feauture provide another choice.

For the feedback of the first version, cache pollution is the main concern
of the mm guys, On the other hand, this feature is really helpful for
some use case. May be we should let the user decide wether to use it.
So a switch is added in the /sys files, users who don’t like it can turn
off the switch, or by configuring a large batch size to reduce cache
pollution.

To make the whole function works, support of pre zero out free huge pages
should be added for hugetlbfs, I will send another patch for it.

Liang Li (4):
  mm: let user decide page reporting option
  mm: pre zero out free pages to speed up page allocation for __GFP_ZERO
  mm: make page reporing worker works better for low order page
  mm: Add batch size for free page reporting

 drivers/virtio/virtio_balloon.c |   3 +
 include/linux/highmem.h         |  31 +++-
 include/linux/page-flags.h      |  16 +-
 include/linux/page_reporting.h  |   3 +
 include/trace/events/mmflags.h  |   7 +
 mm/Kconfig                      |  10 ++
 mm/Makefile                     |   1 +
 mm/huge_memory.c                |   3 +-
 mm/page_alloc.c                 |   4 +
 mm/page_prezero.c               | 266 ++++++++++++++++++++++++++++++++
 mm/page_prezero.h               |  13 ++
 mm/page_reporting.c             |  49 +++++-
 mm/page_reporting.h             |  16 +-
 13 files changed, 405 insertions(+), 17 deletions(-)
 create mode 100644 mm/page_prezero.c
 create mode 100644 mm/page_prezero.h

Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Liang Li <liliang324@gmail.com>

David Hildenbrand Dec. 22, 2020, 8:47 a.m. UTC | #1

On 21.12.20 17:25, Liang Li wrote:
> The first version can be found at: https://lkml.org/lkml/2020/4/12/42
> 
> Zero out the page content usually happens when allocating pages with
> the flag of __GFP_ZERO, this is a time consuming operation, it makes
> the population of a large vma area very slowly. This patch introduce
> a new feature for zero out pages before page allocation, it can help
> to speed up page allocation with __GFP_ZERO.
> 
> My original intention for adding this feature is to shorten VM
> creation time when SR-IOV devicde is attached, it works good and the
> VM creation time is reduced by about 90%.
> 
> Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> =====================================================
> QEMU use 4K pages, THP is off
>                   round1      round2      round3
> w/o this patch:    23.5s       24.7s       24.6s
> w/ this patch:     10.2s       10.3s       11.2s
> 
> QEMU use 4K pages, THP is on
>                   round1      round2      round3
> w/o this patch:    17.9s       14.8s       14.9s
> w/ this patch:     1.9s        1.8s        1.9s
> =====================================================
> 

I am still not convinces that we want/need this for this (main) use
case. Why can't we use huge pages for such use cases (that really care
about VM creation time) and rather deal with pre-zeroing of huge pages
instead?

If possible, I'd like to avoid GFP_ZERO (for reasons already discussed).

> Obviously, it can do more than this. We can benefit from this feature
> in the flowing case:
> 
> Interactive sence
> =================
> Shorten application lunch time on desktop or mobile phone, it can help
> to improve the user experience. Test shows on a
> server [Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz], zero out 1GB RAM by
> the kernel will take about 200ms, while some mainly used application
> like Firefox browser, Office will consume 100 ~ 300 MB RAM just after
> launch, by pre zero out free pages, it means the application launch
> time will be reduced about 20~60ms (can be visual sensed?). May be
> we can make use of this feature to speed up the launch of Andorid APP
> (I didn't do any test for Android).

I am not really sure if you can actually visually sense a difference in
your examples. Startup time of an application is not just memory
allocation (page zeroing) time. It would be interesting of much of a
difference this actually makes in practice. (e.g., firefox startup time
etc.)

> 
> Virtulization
> =============
> Speed up VM creation and shorten guest boot time, especially for PCI
> SR-IOV device passthrough scenario. Compared with some of the para
> vitalization solutions, it is easy to deploy because it’s transparent
> to guest and can handle DMA properly in BIOS stage, while the para
> virtualization solution can’t handle it well.

What is the "para virtualization" approach you are talking about?

> 
> Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
> overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
> to the VMM, VMM will unmap the corresponding host page for reclaim,
> when guest allocate a page just reclaimed, host will allocate a new page
> and zero it out for guest, in this case pre zero out free page will help
> to speed up the proccess of fault in and reduce the performance impaction.

Such faults in the VMM are no different to other faults, when first
accessing a page to be populated. Again, I wonder how much of a
difference it actually makes.

> 
> Speed up kernel routine
> =======================
> This can’t be guaranteed because we don’t pre zero out all the free pages,
> but is true for most case. It can help to speed up some important system
> call just like fork, which will allocate zero pages for building page
> table. And speed up the process of page fault, especially for huge page
> fault. The POC of Hugetlb free page pre zero out has been done.

Would be interesting to have an actual example with some numbers.

> 
> Security
> ========
> This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
> boot options", which zero out page in a asynchronous way. For users can't
> tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
> this feauture provide another choice.

"we don’t pre zero out all the free pages" so this is of little actual use.

Liang Li Dec. 22, 2020, 11:31 a.m. UTC | #2

On Tue, Dec 22, 2020 at 4:47 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 21.12.20 17:25, Liang Li wrote:
> > The first version can be found at: https://lkml.org/lkml/2020/4/12/42
> >
> > Zero out the page content usually happens when allocating pages with
> > the flag of __GFP_ZERO, this is a time consuming operation, it makes
> > the population of a large vma area very slowly. This patch introduce
> > a new feature for zero out pages before page allocation, it can help
> > to speed up page allocation with __GFP_ZERO.
> >
> > My original intention for adding this feature is to shorten VM
> > creation time when SR-IOV devicde is attached, it works good and the
> > VM creation time is reduced by about 90%.
> >
> > Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> > =====================================================
> > QEMU use 4K pages, THP is off
> >                   round1      round2      round3
> > w/o this patch:    23.5s       24.7s       24.6s
> > w/ this patch:     10.2s       10.3s       11.2s
> >
> > QEMU use 4K pages, THP is on
> >                   round1      round2      round3
> > w/o this patch:    17.9s       14.8s       14.9s
> > w/ this patch:     1.9s        1.8s        1.9s
> > =====================================================
> >
>
> I am still not convinces that we want/need this for this (main) use
> case. Why can't we use huge pages for such use cases (that really care
> about VM creation time) and rather deal with pre-zeroing of huge pages
> instead?
>
> If possible, I'd like to avoid GFP_ZERO (for reasons already discussed).
>

Yes, for VM creation, we can simply use hugetlb for that, just like what
I have done in the other series 'mm: support free hugepage pre zero out'
I send the v2 because I think VM creation is just one example we can benefit
from.

> > Obviously, it can do more than this. We can benefit from this feature
> > in the flowing case:
> >
> > Interactive sence
> > =================
> > Shorten application lunch time on desktop or mobile phone, it can help
> > to improve the user experience. Test shows on a
> > server [Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz], zero out 1GB RAM by
> > the kernel will take about 200ms, while some mainly used application
> > like Firefox browser, Office will consume 100 ~ 300 MB RAM just after
> > launch, by pre zero out free pages, it means the application launch
> > time will be reduced about 20~60ms (can be visual sensed?). May be
> > we can make use of this feature to speed up the launch of Andorid APP
> > (I didn't do any test for Android).
>
> I am not really sure if you can actually visually sense a difference in
> your examples. Startup time of an application is not just memory
> allocation (page zeroing) time. It would be interesting of much of a
> difference this actually makes in practice. (e.g., firefox startup time
> etc.)

Yes, using Firefox and Office as an example seems not convincing, maybe a
large Game APP which consumes several GB of RAM is better.

> >
> > Virtulization
> > =============
> > Speed up VM creation and shorten guest boot time, especially for PCI
> > SR-IOV device passthrough scenario. Compared with some of the para
> > vitalization solutions, it is easy to deploy because it’s transparent
> > to guest and can handle DMA properly in BIOS stage, while the para
> > virtualization solution can’t handle it well.
>
> What is the "para virtualization" approach you are talking about?

I refer two topic in the KVM forum 2020, the doc can give more details :
https://static.sched.com/hosted_files/kvmforum2020/48/coIOMMU.pdf
https://static.sched.com/hosted_files/kvmforum2020/51/The%20Practice%20Method%20to%20Speed%20Up%2010x%20Boot-up%20Time%20for%20Guest%20in%20Alibaba%20Cloud.pdf

and the flowing link is mine:
https://static.sched.com/hosted_files/kvmforum2020/90/Speed%20Up%20Creation%20of%20a%20VM%20With%20Passthrough%20GPU.pdf
>
> >
> > Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
> > overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
> > to the VMM, VMM will unmap the corresponding host page for reclaim,
> > when guest allocate a page just reclaimed, host will allocate a new page
> > and zero it out for guest, in this case pre zero out free page will help
> > to speed up the proccess of fault in and reduce the performance impaction.
>
> Such faults in the VMM are no different to other faults, when first
> accessing a page to be populated. Again, I wonder how much of a
> difference it actually makes.
>

I am not just referring to faults in the VMM, I mean the whole process
that handles guest page faults.
without VIRTIO_BALLOON_F_REPORTING, pages used by guests will be zero
out only once by host. With VIRTIO_BALLOON_F_REPORTING, free pages are
reclaimed by the host and may return to the host buddy
free list. When the pages are given back to the guest, the host kernel
needs to zero out it again. It means
with VIRTIO_BALLOON_F_REPORTING, guest memory performance will be
degraded for frequently
zero out operation on host side. The performance degradation will be
obvious for huge page case. Free
page pre zero out can help to make guest memory performance almost the
same as without
VIRTIO_BALLOON_F_REPORTING.

> >
> > Speed up kernel routine
> > =======================
> > This can’t be guaranteed because we don’t pre zero out all the free pages,
> > but is true for most case. It can help to speed up some important system
> > call just like fork, which will allocate zero pages for building page
> > table. And speed up the process of page fault, especially for huge page
> > fault. The POC of Hugetlb free page pre zero out has been done.
>
> Would be interesting to have an actual example with some numbers.

I will try to do some tests to get some numbers.

> >
> > Security
> > ========
> > This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
> > boot options", which zero out page in a asynchronous way. For users can't
> > tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
> > this feauture provide another choice.
> "we don’t pre zero out all the free pages" so this is of little actual use.

OK. It seems none of the reasons listed above is strong enough for
this feature, above all of them, which one is likely to become the
most strong one?  From the implementation, you will find it is
configurable, users don't want to use it can turn it off.  This is not
an option?

Thanks for your comments,  David.

Liang

David Hildenbrand Dec. 22, 2020, 11:57 a.m. UTC | #3

> 
>>>
>>> Virtulization
>>> =============
>>> Speed up VM creation and shorten guest boot time, especially for PCI
>>> SR-IOV device passthrough scenario. Compared with some of the para
>>> vitalization solutions, it is easy to deploy because it’s transparent
>>> to guest and can handle DMA properly in BIOS stage, while the para
>>> virtualization solution can’t handle it well.
>>
>> What is the "para virtualization" approach you are talking about?
> 
> I refer two topic in the KVM forum 2020, the doc can give more details :
> https://static.sched.com/hosted_files/kvmforum2020/48/coIOMMU.pdf
> https://static.sched.com/hosted_files/kvmforum2020/51/The%20Practice%20Method%20to%20Speed%20Up%2010x%20Boot-up%20Time%20for%20Guest%20in%20Alibaba%20Cloud.pdf
> 
> and the flowing link is mine:
> https://static.sched.com/hosted_files/kvmforum2020/90/Speed%20Up%20Creation%20of%20a%20VM%20With%20Passthrough%20GPU.pdf

Thanks for the pointers! I actually did watch your presentation.

>>
>>>
>>> Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
>>> overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
>>> to the VMM, VMM will unmap the corresponding host page for reclaim,
>>> when guest allocate a page just reclaimed, host will allocate a new page
>>> and zero it out for guest, in this case pre zero out free page will help
>>> to speed up the proccess of fault in and reduce the performance impaction.
>>
>> Such faults in the VMM are no different to other faults, when first
>> accessing a page to be populated. Again, I wonder how much of a
>> difference it actually makes.
>>
> 
> I am not just referring to faults in the VMM, I mean the whole process
> that handles guest page faults.
> without VIRTIO_BALLOON_F_REPORTING, pages used by guests will be zero
> out only once by host. With VIRTIO_BALLOON_F_REPORTING, free pages are
> reclaimed by the host and may return to the host buddy
> free list. When the pages are given back to the guest, the host kernel
> needs to zero out it again. It means
> with VIRTIO_BALLOON_F_REPORTING, guest memory performance will be
> degraded for frequently
> zero out operation on host side. The performance degradation will be
> obvious for huge page case. Free
> page pre zero out can help to make guest memory performance almost the
> same as without
> VIRTIO_BALLOON_F_REPORTING.

Yes, what I am saying is that this fault handling is no different to
ordinary faults when accessing a virtual memory location the first time
and populating a page. The only difference is that it happens
continuously, not only the first time we touch a page.

And we might be able to improve handling in the hypervisor in the
future. We have been discussing using MADV_FREE instead of MADV_DONTNEED
in QEMU for handling free page reporting. Then, guest reported pages
will only get reclaimed by the hypervisor when there is actual memory
pressure in the hypervisor (e.g., when about to swap). And zeroing a
page is an obvious improvement over going to swap. The price for zeroing
pages has to be paid at one point.

Also note that we've been discussing cache-related things already. If
you zero out before giving the page to the guest, the page will already
be in the cache - where the guest directly wants to access it.

[...]

>>>
>>> Security
>>> ========
>>> This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
>>> boot options", which zero out page in a asynchronous way. For users can't
>>> tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
>>> this feauture provide another choice.
>> "we don’t pre zero out all the free pages" so this is of little actual use.
> 
> OK. It seems none of the reasons listed above is strong enough for

I was rather saying that for security it's of little use IMHO.
Application/VM start up time might be improved by using huge pages (and
pre-zeroing these). Free page reporting might be improved by using
MADV_FREE instead of MADV_DONTNEED in the hypervisor.

> this feature, above all of them, which one is likely to become the
> most strong one?  From the implementation, you will find it is
> configurable, users don't want to use it can turn it off.  This is not
> an option?

Well, we have to maintain the feature and sacrifice a page flag. For
example, do we expect someone explicitly enabling the feature just to
speed up startup time of an app that consumes a lot of memory? I highly
doubt it.

I'd love to hear opinions of other people. (a lot of people are offline
until beginning of January, including, well, actually me :) )

Matthew Wilcox Dec. 22, 2020, 12:23 p.m. UTC | #4

On Mon, Dec 21, 2020 at 11:25:22AM -0500, Liang Li wrote:
> Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> =====================================================
> QEMU use 4K pages, THP is off
>                   round1      round2      round3
> w/o this patch:    23.5s       24.7s       24.6s
> w/ this patch:     10.2s       10.3s       11.2s
> 
> QEMU use 4K pages, THP is on
>                   round1      round2      round3
> w/o this patch:    17.9s       14.8s       14.9s
> w/ this patch:     1.9s        1.8s        1.9s
> =====================================================

The cost of zeroing pages has to be paid somewhere.  You've successfully
moved it out of this path that you can measure.  So now you've put it
somewhere that you're not measuring.  Why is this a win?

> Speed up kernel routine
> =======================
> This can’t be guaranteed because we don’t pre zero out all the free pages,
> but is true for most case. It can help to speed up some important system
> call just like fork, which will allocate zero pages for building page
> table. And speed up the process of page fault, especially for huge page
> fault. The POC of Hugetlb free page pre zero out has been done.

Try kernbench with and without your patch.

Liang Li Dec. 22, 2020, 2 p.m. UTC | #5

https://static.sched.com/hosted_files/kvmforum2020/51/The%20Practice%20Method%20to%20Speed%20Up%2010x%20Boot-up%20Time%20for%20Guest%20in%20Alibaba%20Cloud.pdf
> >
> > and the flowing link is mine:
> > https://static.sched.com/hosted_files/kvmforum2020/90/Speed%20Up%20Creation%20of%20a%20VM%20With%20Passthrough%20GPU.pdf
>
> Thanks for the pointers! I actually did watch your presentation.

You're welcome!  And thanks for your time! :)
> >>
> >>>
> >>> Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
> >>> overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
> >>> to the VMM, VMM will unmap the corresponding host page for reclaim,
> >>> when guest allocate a page just reclaimed, host will allocate a new page
> >>> and zero it out for guest, in this case pre zero out free page will help
> >>> to speed up the proccess of fault in and reduce the performance impaction.
> >>
> >> Such faults in the VMM are no different to other faults, when first
> >> accessing a page to be populated. Again, I wonder how much of a
> >> difference it actually makes.
> >>
> >
> > I am not just referring to faults in the VMM, I mean the whole process
> > that handles guest page faults.
> > without VIRTIO_BALLOON_F_REPORTING, pages used by guests will be zero
> > out only once by host. With VIRTIO_BALLOON_F_REPORTING, free pages are
> > reclaimed by the host and may return to the host buddy
> > free list. When the pages are given back to the guest, the host kernel
> > needs to zero out it again. It means
> > with VIRTIO_BALLOON_F_REPORTING, guest memory performance will be
> > degraded for frequently
> > zero out operation on host side. The performance degradation will be
> > obvious for huge page case. Free
> > page pre zero out can help to make guest memory performance almost the
> > same as without
> > VIRTIO_BALLOON_F_REPORTING.
>
> Yes, what I am saying is that this fault handling is no different to
> ordinary faults when accessing a virtual memory location the first time
> and populating a page. The only difference is that it happens
> continuously, not only the first time we touch a page.
>
> And we might be able to improve handling in the hypervisor in the
> future. We have been discussing using MADV_FREE instead of MADV_DONTNEED
> in QEMU for handling free page reporting. Then, guest reported pages
> will only get reclaimed by the hypervisor when there is actual memory
> pressure in the hypervisor (e.g., when about to swap). And zeroing a
> page is an obvious improvement over going to swap. The price for zeroing
> pages has to be paid at one point.
>
> Also note that we've been discussing cache-related things already. If
> you zero out before giving the page to the guest, the page will already
> be in the cache - where the guest directly wants to access it.
>

OK, that's very reasonable and much better. Looking forward for your work.

> >>>
> >>> Security
> >>> ========
> >>> This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
> >>> boot options", which zero out page in a asynchronous way. For users can't
> >>> tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
> >>> this feauture provide another choice.
> >> "we don’t pre zero out all the free pages" so this is of little actual use.
> >
> > OK. It seems none of the reasons listed above is strong enough for
>
> I was rather saying that for security it's of little use IMHO.
> Application/VM start up time might be improved by using huge pages (and
> pre-zeroing these). Free page reporting might be improved by using
> MADV_FREE instead of MADV_DONTNEED in the hypervisor.
>
> > this feature, above all of them, which one is likely to become the
> > most strong one?  From the implementation, you will find it is
> > configurable, users don't want to use it can turn it off.  This is not
> > an option?
>
> Well, we have to maintain the feature and sacrifice a page flag. For
> example, do we expect someone explicitly enabling the feature just to
> speed up startup time of an app that consumes a lot of memory? I highly
> doubt it.

In our production environment, there are three main applications have such
requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
for best performance, they populate memory when starting up. For SPDK vhost,
we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
vhost 'live' upgrade, which is done by killing the old process and
starting a new
one with the new binary. In this case, we want the new process started as quick
as possible to shorten the service downtime. We really enable this feature
to speed up startup time for them  :)

> I'd love to hear opinions of other people. (a lot of people are offline
> until beginning of January, including, well, actually me :) )

OK. I will wait some time for others' feedback. Happy holidays!

thanks!

Liang

Liang Li Dec. 22, 2020, 2:42 p.m. UTC | #6

> > =====================================================
> > QEMU use 4K pages, THP is off
> >                   round1      round2      round3
> > w/o this patch:    23.5s       24.7s       24.6s
> > w/ this patch:     10.2s       10.3s       11.2s
> >
> > QEMU use 4K pages, THP is on
> >                   round1      round2      round3
> > w/o this patch:    17.9s       14.8s       14.9s
> > w/ this patch:     1.9s        1.8s        1.9s
> > =====================================================
>
> The cost of zeroing pages has to be paid somewhere.  You've successfully
> moved it out of this path that you can measure.  So now you've put it
> somewhere that you're not measuring.  Why is this a win?

Win or not depends on its effect. For our case, it solves the issue that we
faced, so it can be thought as a win for us.
If others don't have the issue we faced, the result will be different,
maybe they
will be affected by the side effect of this feature. I think this is
your concern
behind the question. right? I will try to do more tests and provide more
benchmark performance data.

> > Speed up kernel routine
> > =======================
> > This can’t be guaranteed because we don’t pre zero out all the free pages,
> > but is true for most case. It can help to speed up some important system
> > call just like fork, which will allocate zero pages for building page
> > table. And speed up the process of page fault, especially for huge page
> > fault. The POC of Hugetlb free page pre zero out has been done.
>
> Try kernbench with and without your patch.

OK. Thanks for your suggestion!

Liang

Daniel Jordan Dec. 22, 2020, 5:11 p.m. UTC | #7

Liang Li <liliang.opensource@gmail.com> writes:
> The first version can be found at: https://lkml.org/lkml/2020/4/12/42
>
> Zero out the page content usually happens when allocating pages with
> the flag of __GFP_ZERO, this is a time consuming operation, it makes
> the population of a large vma area very slowly. This patch introduce
> a new feature for zero out pages before page allocation, it can help
> to speed up page allocation with __GFP_ZERO.

kzeropaged appears to escape some of the kernel's resource controls, at
least if I'm understanding this right.

The heavy part of a page fault is moved out of the faulting task's
context so the CPU controller can't throttle it.  A task that uses
these pages can benefit from clearing done by CPUs that it's not allowed
to run on.  How can it handle these cases?

Alexander Duyck Dec. 22, 2020, 7:13 p.m. UTC | #8

On Mon, Dec 21, 2020 at 8:25 AM Liang Li <liliang.opensource@gmail.com> wrote:
>
> The first version can be found at: https://lkml.org/lkml/2020/4/12/42
>
> Zero out the page content usually happens when allocating pages with
> the flag of __GFP_ZERO, this is a time consuming operation, it makes
> the population of a large vma area very slowly. This patch introduce
> a new feature for zero out pages before page allocation, it can help
> to speed up page allocation with __GFP_ZERO.
>
> My original intention for adding this feature is to shorten VM
> creation time when SR-IOV devicde is attached, it works good and the
> VM creation time is reduced by about 90%.
>
> Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> =====================================================
> QEMU use 4K pages, THP is off
>                   round1      round2      round3
> w/o this patch:    23.5s       24.7s       24.6s
> w/ this patch:     10.2s       10.3s       11.2s
>
> QEMU use 4K pages, THP is on
>                   round1      round2      round3
> w/o this patch:    17.9s       14.8s       14.9s
> w/ this patch:     1.9s        1.8s        1.9s
> =====================================================
>
> Obviously, it can do more than this. We can benefit from this feature
> in the flowing case:

So I am not sure page reporting is the best thing to base this page
zeroing setup on. The idea with page reporting is to essentially act
as a leaky bucket and allow the guest to drop memory it isn't using
slowly so if it needs to reinflate it won't clash with the
applications that need memory. What you are doing here seems far more
aggressive in that you are going down to low order pages and sleeping
instead of rescheduling for the next time interval.

Also I am not sure your SR-IOV creation time test is a good
justification for this extra overhead. With your patches applied all
you are doing is making use of the free time before the test to do the
page zeroing instead of doing it during your test. As such your CPU
overhead prior to running the test would be higher and you haven't
captured that information.

One thing I would be interested in seeing is what is the load this is
adding when you are running simple memory allocation/free type tests
on the system. For example it might be useful to see what the
will-it-scale page_fault1 tests look like with this patch applied
versus not applied. I suspect it would be adding some amount of
overhead as you have to spend a ton of time scanning all the pages and
that will be considerable overhead.

David Hildenbrand Dec. 23, 2020, 8:41 a.m. UTC | #9

[...]

>> I was rather saying that for security it's of little use IMHO.
>> Application/VM start up time might be improved by using huge pages (and
>> pre-zeroing these). Free page reporting might be improved by using
>> MADV_FREE instead of MADV_DONTNEED in the hypervisor.
>>
>>> this feature, above all of them, which one is likely to become the
>>> most strong one?  From the implementation, you will find it is
>>> configurable, users don't want to use it can turn it off.  This is not
>>> an option?
>>
>> Well, we have to maintain the feature and sacrifice a page flag. For
>> example, do we expect someone explicitly enabling the feature just to
>> speed up startup time of an app that consumes a lot of memory? I highly
>> doubt it.
> 
> In our production environment, there are three main applications have such
> requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
> anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
> for best performance, they populate memory when starting up. For SPDK vhost,
> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
> vhost 'live' upgrade, which is done by killing the old process and
> starting a new
> one with the new binary. In this case, we want the new process started as quick
> as possible to shorten the service downtime. We really enable this feature
> to speed up startup time for them  :)

Thanks for info on the use case!

All of these use cases either already use, or could use, huge pages
IMHO. It's not your ordinary proprietary gaming app :) This is where
pre-zeroing of huge pages could already help.

Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
creating a file and pre-zeroing it from another process, or am I missing
something important? At least for QEMU this should work AFAIK, where you
can just pass the file to be use using memory-backend-file.

> 
>> I'd love to hear opinions of other people. (a lot of people are offline
>> until beginning of January, including, well, actually me :) )
> 
> OK. I will wait some time for others' feedback. Happy holidays!

To you too, cheers!

Liang Li Dec. 23, 2020, 12:11 p.m. UTC | #10

On Wed, Dec 23, 2020 at 4:41 PM David Hildenbrand <david@redhat.com> wrote:
>
> [...]
>
> >> I was rather saying that for security it's of little use IMHO.
> >> Application/VM start up time might be improved by using huge pages (and
> >> pre-zeroing these). Free page reporting might be improved by using
> >> MADV_FREE instead of MADV_DONTNEED in the hypervisor.
> >>
> >>> this feature, above all of them, which one is likely to become the
> >>> most strong one?  From the implementation, you will find it is
> >>> configurable, users don't want to use it can turn it off.  This is not
> >>> an option?
> >>
> >> Well, we have to maintain the feature and sacrifice a page flag. For
> >> example, do we expect someone explicitly enabling the feature just to
> >> speed up startup time of an app that consumes a lot of memory? I highly
> >> doubt it.
> >
> > In our production environment, there are three main applications have such
> > requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
> > anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
> > for best performance, they populate memory when starting up. For SPDK vhost,
> > we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
> > vhost 'live' upgrade, which is done by killing the old process and
> > starting a new
> > one with the new binary. In this case, we want the new process started as quick
> > as possible to shorten the service downtime. We really enable this feature
> > to speed up startup time for them  :)
>
> Thanks for info on the use case!
>
> All of these use cases either already use, or could use, huge pages
> IMHO. It's not your ordinary proprietary gaming app :) This is where
> pre-zeroing of huge pages could already help.

You are welcome.  For some historical reason, some of our services are
not using hugetlbfs, that is why I didn't start with hugetlbfs.

> Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
> creating a file and pre-zeroing it from another process, or am I missing
> something important? At least for QEMU this should work AFAIK, where you
> can just pass the file to be use using memory-backend-file.
>
If using another process to create a file, we can offload the overhead to
another process, and there is no need to pre-zeroing it's content, just
populating the memory is enough.
If we do it that way, then how to determine the size of the file? it depends
on the RAM size of the VM the customer buys. Maybe we can create a file
large enough in advance and truncate it to the right size just before the
VM is created. Then, how many large files should be created on a host?
You will find there are a lot of things that have to be handled properly.
I think it's possible to make it work well, but we will transfer the
management complexity to up layer components. It's a bad practice to let
upper layer components process such low level details which should be
handled in the OS layer.

> >
> >> I'd love to hear opinions of other people. (a lot of people are offline
> >> until beginning of January, including, well, actually me :) )
> >
> > OK. I will wait some time for others' feedback. Happy holidays!
>
> To you too, cheers!
>

I have to work at least two months before the vacation. :(

Liang

Michal Hocko Jan. 4, 2021, 12:51 p.m. UTC | #11

On Tue 22-12-20 22:42:13, Liang Li wrote:
> > > =====================================================
> > > QEMU use 4K pages, THP is off
> > >                   round1      round2      round3
> > > w/o this patch:    23.5s       24.7s       24.6s
> > > w/ this patch:     10.2s       10.3s       11.2s
> > >
> > > QEMU use 4K pages, THP is on
> > >                   round1      round2      round3
> > > w/o this patch:    17.9s       14.8s       14.9s
> > > w/ this patch:     1.9s        1.8s        1.9s
> > > =====================================================
> >
> > The cost of zeroing pages has to be paid somewhere.  You've successfully
> > moved it out of this path that you can measure.  So now you've put it
> > somewhere that you're not measuring.  Why is this a win?
> 
> Win or not depends on its effect. For our case, it solves the issue
> that we faced, so it can be thought as a win for us.  If others don't
> have the issue we faced, the result will be different, maybe they will
> be affected by the side effect of this feature. I think this is your
> concern behind the question. right? I will try to do more tests and
> provide more benchmark performance data.

Yes, zeroying memory does have a noticeable overhead but we cannot
simply allow tasks to spil over this overhead to all other users by
default. So if anything this would need to be an opt-in feature
configurable by administrator.

Michal Hocko Jan. 4, 2021, 12:55 p.m. UTC | #12

On Mon 21-12-20 11:25:22, Liang Li wrote:
[...]
> Security
> ========
> This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
> boot options", which zero out page in a asynchronous way. For users can't
> tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
> this feauture provide another choice.

Most of the usecases are about the start up time imporvemtns IIUC. Have
you tried to use init_on_free or this would be prohibitive for your
workloads?

Liang Li Jan. 4, 2021, 1:45 p.m. UTC | #13

> > Win or not depends on its effect. For our case, it solves the issue
> > that we faced, so it can be thought as a win for us.  If others don't
> > have the issue we faced, the result will be different, maybe they will
> > be affected by the side effect of this feature. I think this is your
> > concern behind the question. right? I will try to do more tests and
> > provide more benchmark performance data.
>
> Yes, zeroying memory does have a noticeable overhead but we cannot
> simply allow tasks to spil over this overhead to all other users by
> default. So if anything this would need to be an opt-in feature
> configurable by administrator.
> --
> Michal Hocko
> SUSE Labs

I know the overhead, so I add a switch in /sys/ to enable or disable
it dynamically.

Thanks
Liang

Liang Li Jan. 4, 2021, 2:07 p.m. UTC | #14

On Mon, Jan 4, 2021 at 8:56 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-12-20 11:25:22, Liang Li wrote:
> [...]
> > Security
> > ========
> > This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
> > boot options", which zero out page in a asynchronous way. For users can't
> > tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
> > this feauture provide another choice.
>
> Most of the usecases are about the start up time imporvemtns IIUC. Have
> you tried to use init_on_free or this would be prohibitive for your
> workloads?
>
I have not tried yet. 'init_on_free' may help to shorten the start up time. In
our use case,  we care about both the VM creation time and the VM reboot
time[terminate QEMU process first and launch a new  one], 'init_on_free'
will slow down the termination process and is not helpful for VM reboot.
Our aim is to speed up 'VM start up' and not slow down 'VM shut down'.

Thanks
Liang

David Hildenbrand Jan. 4, 2021, 8:18 p.m. UTC | #15

> Am 23.12.2020 um 13:12 schrieb Liang Li <liliang324@gmail.com>:
> 
> On Wed, Dec 23, 2020 at 4:41 PM David Hildenbrand <david@redhat.com> wrote:
>> 
>> [...]
>> 
>>>> I was rather saying that for security it's of little use IMHO.
>>>> Application/VM start up time might be improved by using huge pages (and
>>>> pre-zeroing these). Free page reporting might be improved by using
>>>> MADV_FREE instead of MADV_DONTNEED in the hypervisor.
>>>> 
>>>>> this feature, above all of them, which one is likely to become the
>>>>> most strong one?  From the implementation, you will find it is
>>>>> configurable, users don't want to use it can turn it off.  This is not
>>>>> an option?
>>>> 
>>>> Well, we have to maintain the feature and sacrifice a page flag. For
>>>> example, do we expect someone explicitly enabling the feature just to
>>>> speed up startup time of an app that consumes a lot of memory? I highly
>>>> doubt it.
>>> 
>>> In our production environment, there are three main applications have such
>>> requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
>>> anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
>>> for best performance, they populate memory when starting up. For SPDK vhost,
>>> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
>>> vhost 'live' upgrade, which is done by killing the old process and
>>> starting a new
>>> one with the new binary. In this case, we want the new process started as quick
>>> as possible to shorten the service downtime. We really enable this feature
>>> to speed up startup time for them  :)

Am I wrong or does using hugeltbfs/tmpfs ... i.e., a file not-deleted between shutting down the old instances and firing up the new instance just solve this issue?

>> 
>> Thanks for info on the use case!
>> 
>> All of these use cases either already use, or could use, huge pages
>> IMHO. It's not your ordinary proprietary gaming app :) This is where
>> pre-zeroing of huge pages could already help.
> 
> You are welcome.  For some historical reason, some of our services are
> not using hugetlbfs, that is why I didn't start with hugetlbfs.
> 
>> Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
>> creating a file and pre-zeroing it from another process, or am I missing
>> something important? At least for QEMU this should work AFAIK, where you
>> can just pass the file to be use using memory-backend-file.
>> 
> If using another process to create a file, we can offload the overhead to
> another process, and there is no need to pre-zeroing it's content, just
> populating the memory is enough.

Right, if non-zero memory can be tolerated (e.g., for vms usually has to).

> If we do it that way, then how to determine the size of the file? it depends
> on the RAM size of the VM the customer buys.
> Maybe we can create a file
> large enough in advance and truncate it to the right size just before the
> VM is created. Then, how many large files should be created on a host?

That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually)

> You will find there are a lot of things that have to be handled properly.
> I think it's possible to make it work well, but we will transfer the
> management complexity to up layer components. It's a bad practice to let
> upper layer components process such low level details which should be
> handled in the OS layer.

It‘s bad practice to squeeze things into the kernel that can just be handled on upper layers ;)

Liang Li Jan. 5, 2021, 2:14 a.m. UTC | #16

> >>> In our production environment, there are three main applications have such
> >>> requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
> >>> anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
> >>> for best performance, they populate memory when starting up. For SPDK vhost,
> >>> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
> >>> vhost 'live' upgrade, which is done by killing the old process and
> >>> starting a new
> >>> one with the new binary. In this case, we want the new process started as quick
> >>> as possible to shorten the service downtime. We really enable this feature
> >>> to speed up startup time for them  :)
>
> Am I wrong or does using hugeltbfs/tmpfs ... i.e., a file not-deleted between shutting down the old instances and firing up the new instance just solve this issue?

You are right, it works for the SPDK vhost upgrade case.

>
> >>
> >> Thanks for info on the use case!
> >>
> >> All of these use cases either already use, or could use, huge pages
> >> IMHO. It's not your ordinary proprietary gaming app :) This is where
> >> pre-zeroing of huge pages could already help.
> >
> > You are welcome.  For some historical reason, some of our services are
> > not using hugetlbfs, that is why I didn't start with hugetlbfs.
> >
> >> Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
> >> creating a file and pre-zeroing it from another process, or am I missing
> >> something important? At least for QEMU this should work AFAIK, where you
> >> can just pass the file to be use using memory-backend-file.
> >>
> > If using another process to create a file, we can offload the overhead to
> > another process, and there is no need to pre-zeroing it's content, just
> > populating the memory is enough.
>
> Right, if non-zero memory can be tolerated (e.g., for vms usually has to).

I mean there is no need to pre-zeroing the file content obviously in user space,
the kernel will do it when populating the memory.

> > If we do it that way, then how to determine the size of the file? it depends
> > on the RAM size of the VM the customer buys.
> > Maybe we can create a file
> > large enough in advance and truncate it to the right size just before the
> > VM is created. Then, how many large files should be created on a host?
>
> That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually)

It depends on how the scheduling component is designed. Yes, you can put
10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on
another one. But if one type of them, e.g. 4C8G are sold out, customers
can't by more 4C8G VM while there are some free 2C4G VMs, the resource
reserved for them can be provided as 4C8G VMs

> > You will find there are a lot of things that have to be handled properly.
> > I think it's possible to make it work well, but we will transfer the
> > management complexity to up layer components. It's a bad practice to let
> > upper layer components process such low level details which should be
> > handled in the OS layer.
>
> It‘s bad practice to squeeze things into the kernel that can just be handled on upper layers ;)
>

You must know there are a lot of functions in the kernel which can
be done in userspace. e.g. Some of the device emulations like APIC,
vhost-net backend which has userspace implementation.   :)
Bad or not depends on the benefits the solution brings.
From the viewpoint of a user space application, the kernel should
provide high performance memory management service. That's why
I think it should be done in the kernel.

Thanks
Liang

David Hildenbrand Jan. 5, 2021, 9:39 a.m. UTC | #17

On 05.01.21 03:14, Liang Li wrote:
>>>>> In our production environment, there are three main applications have such
>>>>> requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
>>>>> anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
>>>>> for best performance, they populate memory when starting up. For SPDK vhost,
>>>>> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
>>>>> vhost 'live' upgrade, which is done by killing the old process and
>>>>> starting a new
>>>>> one with the new binary. In this case, we want the new process started as quick
>>>>> as possible to shorten the service downtime. We really enable this feature
>>>>> to speed up startup time for them  :)
>>
>> Am I wrong or does using hugeltbfs/tmpfs ... i.e., a file not-deleted between shutting down the old instances and firing up the new instance just solve this issue?
> 
> You are right, it works for the SPDK vhost upgrade case.
> 
>>
>>>>
>>>> Thanks for info on the use case!
>>>>
>>>> All of these use cases either already use, or could use, huge pages
>>>> IMHO. It's not your ordinary proprietary gaming app :) This is where
>>>> pre-zeroing of huge pages could already help.
>>>
>>> You are welcome.  For some historical reason, some of our services are
>>> not using hugetlbfs, that is why I didn't start with hugetlbfs.
>>>
>>>> Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
>>>> creating a file and pre-zeroing it from another process, or am I missing
>>>> something important? At least for QEMU this should work AFAIK, where you
>>>> can just pass the file to be use using memory-backend-file.
>>>>
>>> If using another process to create a file, we can offload the overhead to
>>> another process, and there is no need to pre-zeroing it's content, just
>>> populating the memory is enough.
>>
>> Right, if non-zero memory can be tolerated (e.g., for vms usually has to).
> 
> I mean there is no need to pre-zeroing the file content obviously in user space,
> the kernel will do it when populating the memory.
> 
>>> If we do it that way, then how to determine the size of the file? it depends
>>> on the RAM size of the VM the customer buys.
>>> Maybe we can create a file
>>> large enough in advance and truncate it to the right size just before the
>>> VM is created. Then, how many large files should be created on a host?
>>
>> That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually)
> 
> It depends on how the scheduling component is designed. Yes, you can put
> 10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on
> another one. But if one type of them, e.g. 4C8G are sold out, customers
> can't by more 4C8G VM while there are some free 2C4G VMs, the resource
> reserved for them can be provided as 4C8G VMs
> 

1. You can, just the startup time will be a little slower? E.g., grow
pre-allocated 4G file to 8G.

2. Or let's be creative: teach QEMU to construct a single
RAMBlock/MemoryRegion out of multiple tmpfs files. Works as long as you
don't go crazy on different VM sizes / size differences.

3. In your example above, you can dynamically rebalance as VMs are
getting sold, to make sure you always have "big ones" lying around you
can shrink on demand.

> 
> You must know there are a lot of functions in the kernel which can
> be done in userspace. e.g. Some of the device emulations like APIC,
> vhost-net backend which has userspace implementation.   :)
> Bad or not depends on the benefits the solution brings.
> From the viewpoint of a user space application, the kernel should
> provide high performance memory management service. That's why
> I think it should be done in the kernel.

As I expressed a couple of times already, I don't see why using
hugetlbfs and implementing some sort of pre-zeroing there isn't sufficient.

We really don't *want* complicated things deep down in the mm core if
there are reasonable alternatives.

Liang Li Jan. 5, 2021, 10:22 a.m. UTC | #18

> >> That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually)
> >
> > It depends on how the scheduling component is designed. Yes, you can put
> > 10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on
> > another one. But if one type of them, e.g. 4C8G are sold out, customers
> > can't by more 4C8G VM while there are some free 2C4G VMs, the resource
> > reserved for them can be provided as 4C8G VMs
> >
>
> 1. You can, just the startup time will be a little slower? E.g., grow
> pre-allocated 4G file to 8G.
>
> 2. Or let's be creative: teach QEMU to construct a single
> RAMBlock/MemoryRegion out of multiple tmpfs files. Works as long as you
> don't go crazy on different VM sizes / size differences.
>
> 3. In your example above, you can dynamically rebalance as VMs are
> getting sold, to make sure you always have "big ones" lying around you
> can shrink on demand.
>
Yes, we can always come up with some ways to make things work.
it will make the developer of the upper layer component crazy :)

> >
> > You must know there are a lot of functions in the kernel which can
> > be done in userspace. e.g. Some of the device emulations like APIC,
> > vhost-net backend which has userspace implementation.   :)
> > Bad or not depends on the benefits the solution brings.
> > From the viewpoint of a user space application, the kernel should
> > provide high performance memory management service. That's why
> > I think it should be done in the kernel.
>
> As I expressed a couple of times already, I don't see why using
> hugetlbfs and implementing some sort of pre-zeroing there isn't sufficient.

Did I miss something before? I thought you doubt the need for
hugetlbfs free page pre zero out. Hugetlbfs is a good choice and is
sufficient.

> We really don't *want* complicated things deep down in the mm core if
> there are reasonable alternatives.
>
I understand your concern, we should have sufficient reason to add a new
feature to the kernel. And for this one, it's most value is to make the
application's life is easier. And implementing it in hugetlbfs can avoid
adding more complexity to core MM.
I will send out a new revision and drop the part for 'buddy free pages pre
zero out'. Thanks for your suggestion!

Liang

David Hildenbrand Jan. 5, 2021, 10:27 a.m. UTC | #19

On 05.01.21 11:22, Liang Li wrote:
>>>> That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually)
>>>
>>> It depends on how the scheduling component is designed. Yes, you can put
>>> 10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on
>>> another one. But if one type of them, e.g. 4C8G are sold out, customers
>>> can't by more 4C8G VM while there are some free 2C4G VMs, the resource
>>> reserved for them can be provided as 4C8G VMs
>>>
>>
>> 1. You can, just the startup time will be a little slower? E.g., grow
>> pre-allocated 4G file to 8G.
>>
>> 2. Or let's be creative: teach QEMU to construct a single
>> RAMBlock/MemoryRegion out of multiple tmpfs files. Works as long as you
>> don't go crazy on different VM sizes / size differences.
>>
>> 3. In your example above, you can dynamically rebalance as VMs are
>> getting sold, to make sure you always have "big ones" lying around you
>> can shrink on demand.
>>
> Yes, we can always come up with some ways to make things work.
> it will make the developer of the upper layer component crazy :)

I'd say that's life in upper layers to optimize special (!) use cases. :)

>>>
>>> You must know there are a lot of functions in the kernel which can
>>> be done in userspace. e.g. Some of the device emulations like APIC,
>>> vhost-net backend which has userspace implementation.   :)
>>> Bad or not depends on the benefits the solution brings.
>>> From the viewpoint of a user space application, the kernel should
>>> provide high performance memory management service. That's why
>>> I think it should be done in the kernel.
>>
>> As I expressed a couple of times already, I don't see why using
>> hugetlbfs and implementing some sort of pre-zeroing there isn't sufficient.
> 
> Did I miss something before? I thought you doubt the need for
> hugetlbfs free page pre zero out. Hugetlbfs is a good choice and is
> sufficient.

I remember even suggesting to focus on hugetlbfs during your KVM talk
when chatting. Maybe I was not clear before.

> 
>> We really don't *want* complicated things deep down in the mm core if
>> there are reasonable alternatives.
>>
> I understand your concern, we should have sufficient reason to add a new
> feature to the kernel. And for this one, it's most value is to make the
> application's life is easier. And implementing it in hugetlbfs can avoid
> adding more complexity to core MM.

Exactly, that's my point. Some people might still disagree with the
hugetlbfs approach, but there it's easier to add tunables without
affecting the overall system.

[RFC,v2,0/4] speed up page allocation for __GFP_ZERO

Message

Comments