mbox series

[RFC,0/4] mm: Add PG_zero support

Message ID 20200412090728.GA19572@open-light-1.localdomain (mailing list archive)
Headers show
Series mm: Add PG_zero support | expand

Message

Liang Li April 12, 2020, 9:07 a.m. UTC
Zero out the page content usually happens when allocating pages,
this is a time consuming operation, it makes pin and mlock
operation very slowly, especially for a large batch of memory.

This patch introduce a new feature for zero out pages before page
allocation, it can help to speed up page allocation.

The idea is very simple, zero out free pages when the system is
not busy and mark the page with PG_zero, when allocating a page,
if the page need to be filled with zero, check the flag in the
struct page, if it's marked as PG_zero, zero out can be skipped,
it can save cpu time and speed up page allocation.

This serial is based on the feature 'free page reporting' which
introduced by Alexander Duyck 

We can benefit from this feature in the flowing case:
    1. User space mlock a large chunk of memory
    2. VFIO pin pages for DMA
    3. Allocating transparent huge page
    4. Speed up page fault process

My original intention for adding this feature is to shorten
VM creation time when VFIO device is attached, it works good 
and the VM creation time is reduced obviously. 

Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
=====================================================
QEMU use 4K pages, THP is off
                  round1      round2      round3
w/o this patch:    23.5s       24.7s       24.6s 
w/ this patch:     10.2s       10.3s       11.2s

QEMU use 4K pages, THP is on
                  round1      round2      round3
w/o this patch:    17.9s       14.8s       14.9s 
w/ this patch:     1.9s        1.8s        1.9s
=====================================================

Look forward to your feedbacks.

Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>  
Cc: Michal Hocko <mhocko@kernel.org> 
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: liliangleo <liliangleo@didiglobal.com>

liliangleo (4):
  mm: reduce the impaction of page reporing worker
  mm: Add batch size for free page reporting
  mm: add sys fs configuration for page reporting
  mm: Add PG_zero support

 include/linux/highmem.h        |  31 ++++++-
 include/linux/page-flags.h     |  18 +++-
 include/trace/events/mmflags.h |   7 ++
 mm/Kconfig                     |  10 +++
 mm/Makefile                    |   1 +
 mm/huge_memory.c               |   3 +-
 mm/page_alloc.c                |   2 +
 mm/page_reporting.c            | 181 +++++++++++++++++++++++++++++++++++++++--
 mm/page_reporting.h            |  16 +++-
 mm/zero_page.c                 | 151 ++++++++++++++++++++++++++++++++++
 mm/zero_page.h                 |  13 +++
 11 files changed, 416 insertions(+), 17 deletions(-)
 create mode 100644 mm/zero_page.c
 create mode 100644 mm/zero_page.h

Comments

Dave Hansen April 13, 2020, 1:43 a.m. UTC | #1
On 4/12/20 2:07 AM, liliangleo wrote:
> Zero out the page content usually happens when allocating pages,
> this is a time consuming operation, it makes pin and mlock
> operation very slowly, especially for a large batch of memory.
> 
> This patch introduce a new feature for zero out pages before page
> allocation, it can help to speed up page allocation.

I think the bar for getting something like this merged is going to be
pretty high.  We have a long history of zeroing close to page use for
cache warmth reasons.  Starting up big VMs which won't soon touch the
memory they are allocating is basically the most pathological case
against our approach since they don't *care* about cache warmth.

I'm also not sure it's something we _want_ to optimize for.

VFIO's unconditional page pinning is the real problem here IMNHO.  They
don't *really* need to pin the memory.  We just don't have good
paravirtualized IOMMU support or want to pay the runtime cost for
pin/unpin operations.  You *could* totally have speedy VM startup if
only the pages being accessed or having DMA performed to them were
allocated.  But, the hacks that are in place mean that everything must
be pinned.
Alex Williamson April 13, 2020, 2:49 p.m. UTC | #2
On Sun, 12 Apr 2020 18:43:07 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 4/12/20 2:07 AM, liliangleo wrote:
> > Zero out the page content usually happens when allocating pages,
> > this is a time consuming operation, it makes pin and mlock
> > operation very slowly, especially for a large batch of memory.
> > 
> > This patch introduce a new feature for zero out pages before page
> > allocation, it can help to speed up page allocation.  
> 
> I think the bar for getting something like this merged is going to be
> pretty high.  We have a long history of zeroing close to page use for
> cache warmth reasons.  Starting up big VMs which won't soon touch the
> memory they are allocating is basically the most pathological case
> against our approach since they don't *care* about cache warmth.
> 
> I'm also not sure it's something we _want_ to optimize for.
> 
> VFIO's unconditional page pinning is the real problem here IMNHO.  They
> don't *really* need to pin the memory.  We just don't have good
> paravirtualized IOMMU support or want to pay the runtime cost for
> pin/unpin operations.  You *could* totally have speedy VM startup if
> only the pages being accessed or having DMA performed to them were
> allocated.  But, the hacks that are in place mean that everything must
> be pinned.

Maybe in an SEV or Secure Boot environment we can assume the VM guest
OS uses the IOMMU exclusively for DMA, but otherwise the IOMMU is
optional (at least for x86, other archs do require IOMMU support
afaik).  Therefore, how would we know which pages to pin when there are
only limited configs where we might be able to lean on the vIOMMU to
this extent?  Thanks,

Alex
Dave Hansen April 13, 2020, 3:14 p.m. UTC | #3
On 4/13/20 7:49 AM, Alex Williamson wrote:
>> VFIO's unconditional page pinning is the real problem here IMNHO.  They
>> don't *really* need to pin the memory.  We just don't have good
>> paravirtualized IOMMU support or want to pay the runtime cost for
>> pin/unpin operations.  You *could* totally have speedy VM startup if
>> only the pages being accessed or having DMA performed to them were
>> allocated.  But, the hacks that are in place mean that everything must
>> be pinned.
> Maybe in an SEV or Secure Boot environment we can assume the VM guest
> OS uses the IOMMU exclusively for DMA, but otherwise the IOMMU is
> optional (at least for x86, other archs do require IOMMU support
> afaik).  Therefore, how would we know which pages to pin when there are
> only limited configs where we might be able to lean on the vIOMMU to
> this extent?  Thanks,

You can delay pinning until the device is actually used.  That should be
late enough for the host to figure out whether a paravirtualized IOMMU
is in place.
Ashok Raj April 13, 2020, 3:25 p.m. UTC | #4
On Mon, Apr 13, 2020 at 08:14:32AM -0700, Dave Hansen wrote:
> On 4/13/20 7:49 AM, Alex Williamson wrote:
> >> VFIO's unconditional page pinning is the real problem here IMNHO.  They
> >> don't *really* need to pin the memory.  We just don't have good
> >> paravirtualized IOMMU support or want to pay the runtime cost for
> >> pin/unpin operations.  You *could* totally have speedy VM startup if
> >> only the pages being accessed or having DMA performed to them were
> >> allocated.  But, the hacks that are in place mean that everything must
> >> be pinned.
> > Maybe in an SEV or Secure Boot environment we can assume the VM guest
> > OS uses the IOMMU exclusively for DMA, but otherwise the IOMMU is
> > optional (at least for x86, other archs do require IOMMU support
> > afaik).  Therefore, how would we know which pages to pin when there are
> > only limited configs where we might be able to lean on the vIOMMU to
> > this extent?  Thanks,
> 
> You can delay pinning until the device is actually used.  That should be
> late enough for the host to figure out whether a paravirtualized IOMMU
> is in place.

When you have a device assigned to a guest, it is used when the guest starts
probing the device. Some devices like VF's need DMA even to probe and get
resources assigned from the PF.

The only way we can do this is when device support ATS and PRS. And host
iommu driver to know if this fault needs to be handled by the host (if the
2nd level is at fault), or the guest if the walk in first level isn't
resolved.

2nd level faults need to be resolved by the VMM.
Alex Williamson April 13, 2020, 3:47 p.m. UTC | #5
On Mon, 13 Apr 2020 08:14:32 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 4/13/20 7:49 AM, Alex Williamson wrote:
> >> VFIO's unconditional page pinning is the real problem here IMNHO.  They
> >> don't *really* need to pin the memory.  We just don't have good
> >> paravirtualized IOMMU support or want to pay the runtime cost for
> >> pin/unpin operations.  You *could* totally have speedy VM startup if
> >> only the pages being accessed or having DMA performed to them were
> >> allocated.  But, the hacks that are in place mean that everything must
> >> be pinned.  
> > Maybe in an SEV or Secure Boot environment we can assume the VM guest
> > OS uses the IOMMU exclusively for DMA, but otherwise the IOMMU is
> > optional (at least for x86, other archs do require IOMMU support
> > afaik).  Therefore, how would we know which pages to pin when there are
> > only limited configs where we might be able to lean on the vIOMMU to
> > this extent?  Thanks,  
> 
> You can delay pinning until the device is actually used.  That should be
> late enough for the host to figure out whether a paravirtualized IOMMU
> is in place.

So the guest enables the bus master bit in the command register and at
that point we'd stall the VM for an indeterminate length of time while
we potentially pin all memory, and hope that both the user and the host
has the resources to account and allocate that memory, otherwise the
VM suddenly crashes?  All of this potentially taking place in the
pre-boot environment to support option ROMs as well.  A delay starting
the VM seems a lot more predictable.  Thanks,

Alex
Dave Hansen April 13, 2020, 4:43 p.m. UTC | #6
On 4/13/20 8:47 AM, Alex Williamson wrote:
>> You can delay pinning until the device is actually used.  That should be
>> late enough for the host to figure out whether a paravirtualized IOMMU
>> is in place.
> So the guest enables the bus master bit in the command register and at
> that point we'd stall the VM for an indeterminate length of time while
> we potentially pin all memory, and hope that both the user and the host
> has the resources to account and allocate that memory, otherwise the
> VM suddenly crashes?  All of this potentially taking place in the
> pre-boot environment to support option ROMs as well.  A delay starting
> the VM seems a lot more predictable.  Thanks,

BTW, there are a million ways to speed up VM startup without both
complicating the core VM *and* slowing down everybody that gets a
speedup from cache-hot pages coming out of the allocator.

Use ramfs or hugetlbfs files.  Have a bunch of them sitting around,
preallocated (and zeroed) and dole them out as VMs start up.

Instead of complicating the core VM, do the pre-zeroing in hugetlbfs.
Zeroing at the time the pages get added to the pool wouldn't be the
worst thing and wouldn't touch the core VM.
David Hildenbrand April 14, 2020, 12:01 p.m. UTC | #7
On 12.04.20 11:07, liliangleo wrote:
> Zero out the page content usually happens when allocating pages,
> this is a time consuming operation, it makes pin and mlock
> operation very slowly, especially for a large batch of memory.
> 
> This patch introduce a new feature for zero out pages before page
> allocation, it can help to speed up page allocation.
> 
> The idea is very simple, zero out free pages when the system is
> not busy and mark the page with PG_zero, when allocating a page,
> if the page need to be filled with zero, check the flag in the
> struct page, if it's marked as PG_zero, zero out can be skipped,
> it can save cpu time and speed up page allocation.
> 
> This serial is based on the feature 'free page reporting' which
> introduced by Alexander Duyck 
> 
> We can benefit from this feature in the flowing case:
>     1. User space mlock a large chunk of memory
>     2. VFIO pin pages for DMA
>     3. Allocating transparent huge page
>     4. Speed up page fault process
> 
> My original intention for adding this feature is to shorten
> VM creation time when VFIO device is attached, it works good 
> and the VM creation time is reduced obviously. 
> 
> Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> =====================================================
> QEMU use 4K pages, THP is off
>                   round1      round2      round3
> w/o this patch:    23.5s       24.7s       24.6s 
> w/ this patch:     10.2s       10.3s       11.2s
> 
> QEMU use 4K pages, THP is on
>                   round1      round2      round3
> w/o this patch:    17.9s       14.8s       14.9s 
> w/ this patch:     1.9s        1.8s        1.9s
> =====================================================
> 
> Look forward to your feedbacks.

I somehow have the feeling that this should not be glued to free page
reporting. After all, you are proposing your own status indicator for
each buddy page (PG_zero) already, which would mean you can build
something similar to free page reporting fairly easily, and have it
co-exist.

The free page reporting infrastructure is helpful when wanting to
asynchronously batch-process higher-order pages. I don't see the
immediate need for the "batch-processing here".

E.g., why not simply zero out pages as they are freed/placed into free
lists? Especially, this is one of the simple alternatives to free page
reporting as we have it today (guest zeroes free pages, hypervisor
detects free pages using e.g., ksm).

That could even allow you to avoid the PG_zero flag completely. E.g.,
once the feature is activated and running, all pages in the buddy free
lists are zeroed out already. Zeroing happens synchronously from the
page-freeing thread, not when starting a guest.

Having that said, I agree with Dave here, that there might be better
alternatives for this somewhat-special-case.
Alexander Duyck April 14, 2020, 3:07 p.m. UTC | #8
On Tue, Apr 14, 2020 at 5:01 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.04.20 11:07, liliangleo wrote:
> > Zero out the page content usually happens when allocating pages,
> > this is a time consuming operation, it makes pin and mlock
> > operation very slowly, especially for a large batch of memory.
> >
> > This patch introduce a new feature for zero out pages before page
> > allocation, it can help to speed up page allocation.
> >
> > The idea is very simple, zero out free pages when the system is
> > not busy and mark the page with PG_zero, when allocating a page,
> > if the page need to be filled with zero, check the flag in the
> > struct page, if it's marked as PG_zero, zero out can be skipped,
> > it can save cpu time and speed up page allocation.
> >
> > This serial is based on the feature 'free page reporting' which
> > introduced by Alexander Duyck
> >
> > We can benefit from this feature in the flowing case:
> >     1. User space mlock a large chunk of memory
> >     2. VFIO pin pages for DMA
> >     3. Allocating transparent huge page
> >     4. Speed up page fault process
> >
> > My original intention for adding this feature is to shorten
> > VM creation time when VFIO device is attached, it works good
> > and the VM creation time is reduced obviously.
> >
> > Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> > =====================================================
> > QEMU use 4K pages, THP is off
> >                   round1      round2      round3
> > w/o this patch:    23.5s       24.7s       24.6s
> > w/ this patch:     10.2s       10.3s       11.2s
> >
> > QEMU use 4K pages, THP is on
> >                   round1      round2      round3
> > w/o this patch:    17.9s       14.8s       14.9s
> > w/ this patch:     1.9s        1.8s        1.9s
> > =====================================================
> >
> > Look forward to your feedbacks.
>
> I somehow have the feeling that this should not be glued to free page
> reporting. After all, you are proposing your own status indicator for
> each buddy page (PG_zero) already, which would mean you can build
> something similar to free page reporting fairly easily, and have it
> co-exist.
>
> The free page reporting infrastructure is helpful when wanting to
> asynchronously batch-process higher-order pages. I don't see the
> immediate need for the "batch-processing here".
>
> E.g., why not simply zero out pages as they are freed/placed into free
> lists? Especially, this is one of the simple alternatives to free page
> reporting as we have it today (guest zeroes free pages, hypervisor
> detects free pages using e.g., ksm).

The problem with doing it at free is that it would be just as
expensive as doing it at allocation, only you would likely see it in
more cases as more applications are more likely to free all of their
memory at once on exit, while only a few will pin all of their pages
at the start.

> That could even allow you to avoid the PG_zero flag completely. E.g.,
> once the feature is activated and running, all pages in the buddy free
> lists are zeroed out already. Zeroing happens synchronously from the
> page-freeing thread, not when starting a guest.
>
> Having that said, I agree with Dave here, that there might be better
> alternatives for this somewhat-special-case.

I wonder if it wouldn't make more sense to look at the option of
splitting the initialization work up over multiple CPUs instead of
leaving it all single threaded. The data above was creating a VM with
64GB of RAM and 32 CPUs. How fast could we zero the pages if we were
performing the zeroing over those 32 CPUs? I wonder if we couldn't
look at recruiting other CPUs on the same node to perform the zeroing
like what Dan had originally proposed for ZONE_DEVICE initialization a
couple years ago[1].

Thanks.

- Alex

[1]: https://lore.kernel.org/linux-mm/153077336359.40830.13007326947037437465.stgit@dwillia2-desk3.amr.corp.intel.com/
Daniel Jordan April 14, 2020, 3:40 p.m. UTC | #9
On Tue, Apr 14, 2020 at 08:07:32AM -0700, Alexander Duyck wrote:
> On Tue, Apr 14, 2020 at 5:01 AM David Hildenbrand <david@redhat.com> wrote:
> > Having that said, I agree with Dave here, that there might be better
> > alternatives for this somewhat-special-case.
> 
> I wonder if it wouldn't make more sense to look at the option of
> splitting the initialization work up over multiple CPUs instead of
> leaving it all single threaded. The data above was creating a VM with
> 64GB of RAM and 32 CPUs. How fast could we zero the pages if we were
> performing the zeroing over those 32 CPUs? I wonder if we couldn't
> look at recruiting other CPUs on the same node to perform the zeroing
> like what Dan had originally proposed for ZONE_DEVICE initialization a
> couple years ago[1].

This is exactly what I've done for VFIO.  Some performance results:

    https://lore.kernel.org/linux-mm/20181105165558.11698-10-daniel.m.jordan@oracle.com/

and a semi-current branch is here if anyone wants to test it:

  https://lore.kernel.org/linux-mm/20200212224731.kmss6o6agekkg3mw@ca-dmjordan1.us.oracle.com/

One of the issues with starting extra threads for paths triggered from
userspace such as VFIO is that they need to be properly throttled by relevant
resource controls such as cgroup (CPU controller especially) and
sched_setafffinity.  This type of control for kernel threads has another use
case too, async memcg reclaim.  All this is second on my list after I post a
series that multithreads deferred page init and sets up the basic
infrastructure for multithreading other paths, which I hope will be ready soon.

> [1]: https://lore.kernel.org/linux-mm/153077336359.40830.13007326947037437465.stgit@dwillia2-desk3.amr.corp.intel.com/

I haven't looked closely at memmap_init_zone, though I've tried
memmap_init_zone_device.  Will take a closer look to see how well this could be
incorporated.

Daniel
David Hildenbrand April 14, 2020, 3:44 p.m. UTC | #10
On 14.04.20 17:07, Alexander Duyck wrote:
> On Tue, Apr 14, 2020 at 5:01 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.04.20 11:07, liliangleo wrote:
>>> Zero out the page content usually happens when allocating pages,
>>> this is a time consuming operation, it makes pin and mlock
>>> operation very slowly, especially for a large batch of memory.
>>>
>>> This patch introduce a new feature for zero out pages before page
>>> allocation, it can help to speed up page allocation.
>>>
>>> The idea is very simple, zero out free pages when the system is
>>> not busy and mark the page with PG_zero, when allocating a page,
>>> if the page need to be filled with zero, check the flag in the
>>> struct page, if it's marked as PG_zero, zero out can be skipped,
>>> it can save cpu time and speed up page allocation.
>>>
>>> This serial is based on the feature 'free page reporting' which
>>> introduced by Alexander Duyck
>>>
>>> We can benefit from this feature in the flowing case:
>>>     1. User space mlock a large chunk of memory
>>>     2. VFIO pin pages for DMA
>>>     3. Allocating transparent huge page
>>>     4. Speed up page fault process
>>>
>>> My original intention for adding this feature is to shorten
>>> VM creation time when VFIO device is attached, it works good
>>> and the VM creation time is reduced obviously.
>>>
>>> Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
>>> =====================================================
>>> QEMU use 4K pages, THP is off
>>>                   round1      round2      round3
>>> w/o this patch:    23.5s       24.7s       24.6s
>>> w/ this patch:     10.2s       10.3s       11.2s
>>>
>>> QEMU use 4K pages, THP is on
>>>                   round1      round2      round3
>>> w/o this patch:    17.9s       14.8s       14.9s
>>> w/ this patch:     1.9s        1.8s        1.9s
>>> =====================================================
>>>
>>> Look forward to your feedbacks.
>>
>> I somehow have the feeling that this should not be glued to free page
>> reporting. After all, you are proposing your own status indicator for
>> each buddy page (PG_zero) already, which would mean you can build
>> something similar to free page reporting fairly easily, and have it
>> co-exist.
>>
>> The free page reporting infrastructure is helpful when wanting to
>> asynchronously batch-process higher-order pages. I don't see the
>> immediate need for the "batch-processing here".
>>
>> E.g., why not simply zero out pages as they are freed/placed into free
>> lists? Especially, this is one of the simple alternatives to free page
>> reporting as we have it today (guest zeroes free pages, hypervisor
>> detects free pages using e.g., ksm).
> 
> The problem with doing it at free is that it would be just as
> expensive as doing it at allocation, only you would likely see it in
> more cases as more applications are more likely to free all of their
> memory at once on exit, while only a few will pin all of their pages
> at the start.

If you want to have zeroed-out memory, you'll have to pay a price. So
the question is "when to do it" and "how to do it". This series proposes
to do it asynchronously from another thread.

> 
>> That could even allow you to avoid the PG_zero flag completely. E.g.,
>> once the feature is activated and running, all pages in the buddy free
>> lists are zeroed out already. Zeroing happens synchronously from the
>> page-freeing thread, not when starting a guest.
>>
>> Having that said, I agree with Dave here, that there might be better
>> alternatives for this somewhat-special-case.
> 
> I wonder if it wouldn't make more sense to look at the option of
> splitting the initialization work up over multiple CPUs instead of
> leaving it all single threaded. The data above was creating a VM with
> 64GB of RAM and 32 CPUs. How fast could we zero the pages if we were
> performing the zeroing over those 32 CPUs? I wonder if we couldn't
> look at recruiting other CPUs on the same node to perform the zeroing

Sounds interesting, especially at allocation time. Maybe possible in
combination with Dave's comment "Use ramfs or hugetlbfs files. Have a
bunch of them sitting around, preallocated (and zeroed).". IMHO
something like that makes more sense than doing it asynchronously from
another thread "slowing down everybody that gets a speedup from
cache-hot pages coming out of the allocator" (Dave's comment again :) )