[RFC] mm: align anon mmap for THP

Message ID	20190111201003.19755-1-mike.kravetz@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of mike.kravetz@oracle.com designates 141.146.126.79 as permitted sender) client-ip=141.146.126.79; From: Mike Kravetz <mike.kravetz@oracle.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Hugh Dickins <hughd@google.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Michal Hocko <mhocko@kernel.org>, Dan Williams <dan.j.williams@intel.com>, Matthew Wilcox <willy@infradead.org>, Toshi Kani <toshi.kani@hpe.com>, Boaz Harrosh <boazh@netapp.com>, Andrew Morton <akpm@linux-foundation.org>, Mike Kravetz <mike.kravetz@oracle.com> Subject: [RFC PATCH] mm: align anon mmap for THP Date: Fri, 11 Jan 2019 12:10:03 -0800 Message-Id: <20190111201003.19755-1-mike.kravetz@oracle.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] mm: align anon mmap for THP \| expand [RFC] mm: align anon mmap for THP

Mike Kravetz Jan. 11, 2019, 8:10 p.m. UTC

At LPC last year, Boaz Harrosh asked why he had to 'jump through hoops'
to get an address returned by mmap() suitably aligned for THP.  It seems
that if mmap is asking for a mapping length greater than huge page
size, it should align the returned address to huge page size.

THP alignment has already been added for DAX, shm and tmpfs.  However,
simple anon mappings does not take THP alignment into account.

I could not determine if this was ever considered or discussed in the past.

There is a maze of arch specific and independent get_unmapped_area
routines.  The patch below just modifies the common vm_unmapped_area
routine.  It may be too simplistic, but I wanted to throw out some
code while asking if something like this has ever been considered.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/huge_mm.h |  6 ++++++
 include/linux/mm.h      |  3 +++
 mm/mmap.c               | 11 +++++++++++
 3 files changed, 20 insertions(+)

kirill.shutemov@linux.intel.com Jan. 11, 2019, 9:55 p.m. UTC | #1

On Fri, Jan 11, 2019 at 08:10:03PM +0000, Mike Kravetz wrote:
> At LPC last year, Boaz Harrosh asked why he had to 'jump through hoops'
> to get an address returned by mmap() suitably aligned for THP.  It seems
> that if mmap is asking for a mapping length greater than huge page
> size, it should align the returned address to huge page size.
> 
> THP alignment has already been added for DAX, shm and tmpfs.  However,
> simple anon mappings does not take THP alignment into account.

In general case, when no hint address provided, all anonymous memory
requests have tendency to clamp into a single bigger VMA and get you
better chance having THP, even if a single allocation is too small.
This patch will *reduce* the effect and I guess the net result will be
net negative.

The patch also effectively reduces bit available for ASLR and increases
address space fragmentation (increases number of VMA and therefore page
fault cost).

I think any change in this direction has to be way more data driven.

Mike Kravetz Jan. 11, 2019, 11:28 p.m. UTC | #2

On 1/11/19 1:55 PM, Kirill A. Shutemov wrote:
> On Fri, Jan 11, 2019 at 08:10:03PM +0000, Mike Kravetz wrote:
>> At LPC last year, Boaz Harrosh asked why he had to 'jump through hoops'
>> to get an address returned by mmap() suitably aligned for THP.  It seems
>> that if mmap is asking for a mapping length greater than huge page
>> size, it should align the returned address to huge page size.
>>
>> THP alignment has already been added for DAX, shm and tmpfs.  However,
>> simple anon mappings does not take THP alignment into account.
> 
> In general case, when no hint address provided, all anonymous memory
> requests have tendency to clamp into a single bigger VMA and get you
> better chance having THP, even if a single allocation is too small.
> This patch will *reduce* the effect and I guess the net result will be
> net negative.

Ah!  I forgot about combining like mappings into a single vma.  Increasing
alignment could/would prevent this.

> The patch also effectively reduces bit available for ASLR and increases
> address space fragmentation (increases number of VMA and therefore page
> fault cost).
> 
> I think any change in this direction has to be way more data driven.

Ok, I just wanted to ask the question.  I've seen application code doing
the 'mmap sufficiently large area' then unmap to get desired alignment
trick.  Was wondering if there was something we could do to help.

Thanks

Kirill A . Shutemov Jan. 14, 2019, 1:50 p.m. UTC | #3

On Fri, Jan 11, 2019 at 03:28:37PM -0800, Mike Kravetz wrote:
> Ok, I just wanted to ask the question.  I've seen application code doing
> the 'mmap sufficiently large area' then unmap to get desired alignment
> trick.  Was wondering if there was something we could do to help.

Application may want to get aligned allocation for different reasons.
It should be okay for userspace to ask for size + (alignment - PAGE_SIZE)
and then round up the address to get the alignment. We basically do the
same on kernel side.

For THP, I believe, kernel already does The Right Thing™ for most users.
User still may want to get speific range as THP (to avoid false sharing or
something). But still I believe userspace has all required tools to get it
right.

Steven Sistare Jan. 14, 2019, 3:35 p.m. UTC | #4

On 1/11/2019 6:28 PM, Mike Kravetz wrote:
> On 1/11/19 1:55 PM, Kirill A. Shutemov wrote:
>> On Fri, Jan 11, 2019 at 08:10:03PM +0000, Mike Kravetz wrote:
>>> At LPC last year, Boaz Harrosh asked why he had to 'jump through hoops'
>>> to get an address returned by mmap() suitably aligned for THP.  It seems
>>> that if mmap is asking for a mapping length greater than huge page
>>> size, it should align the returned address to huge page size.

A better heuristic would be to return an aligned address if the length
is a multiple of the huge page size.  The gap (if any) between the end of
the previous VMA and the start of this VMA would be filled by subsequent
smaller mmap requests.  The new behavior would need to become part of the
mmap interface definition so apps can rely on it and omit their hoop-jumping
code.

Personally I would like to see a new MAP_ALIGN flag and treat the addr
argument as the alignment (like Solaris), but I am told that adding flags
is problematic because old kernels accept undefined flag bits from userland
without complaint, so their behavior would change.

- Steve

>>> THP alignment has already been added for DAX, shm and tmpfs.  However,
>>> simple anon mappings does not take THP alignment into account.
>>
>> In general case, when no hint address provided, all anonymous memory
>> requests have tendency to clamp into a single bigger VMA and get you
>> better chance having THP, even if a single allocation is too small.
>> This patch will *reduce* the effect and I guess the net result will be
>> net negative.
> 
> Ah!  I forgot about combining like mappings into a single vma.  Increasing
> alignment could/would prevent this.
> 
>> The patch also effectively reduces bit available for ASLR and increases
>> address space fragmentation (increases number of VMA and therefore page
>> fault cost).
>>
>> I think any change in this direction has to be way more data driven.
> 
> Ok, I just wanted to ask the question.  I've seen application code doing
> the 'mmap sufficiently large area' then unmap to get desired alignment
> trick.  Was wondering if there was something we could do to help.
> 
> Thanks
>

Harrosh, Boaz Jan. 14, 2019, 4:29 p.m. UTC | #5

Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Fri, Jan 11, 2019 at 03:28:37PM -0800, Mike Kravetz wrote:
>> Ok, I just wanted to ask the question.  I've seen application code doing
>> the 'mmap sufficiently large area' then unmap to get desired alignment
>> trick.  Was wondering if there was something we could do to help.
>
> Application may want to get aligned allocation for different reasons.
> It should be okay for userspace to ask for size + (alignment - PAGE_SIZE)
> and then round up the address to get the alignment. We basically do the
> same on kernel side.
>

This is what we do and will need to keep doing for old Kernels.
But it is a pity that those holes can not be reused for small maps, and most important
that we cannot have "mapping holes" around the mapping that catch memory
overruns

> For THP, I believe, kernel already does The Right Thing™ for most users.
> User still may want to get speific range as THP (to avoid false sharing or
> something).

I'm an OK Kernel programmer.  But I was not able to create a HugePage mapping
against /dev/shm/ in a reliable way. I think it only worked on Fedora 28/29
but not on any other distro/version. (MMAP_HUGE)

We run with our own compiled Kernel on various distros, THP is configured
in but mmap against /dev/shm/ never gives me Huge pages. Does it only
work with unanimous mmap ? (I think it is mount dependent which is not
in the application control)

Just a rant. One day I will figure this out. Meanwhile I do this ugly
user mode aligns the pointers, and try to sleep at night ...

> But still I believe userspace has all required tools to get it
> right.
>

I still wish that if I ask for an mmap size aligned on 2M that I would automatically
get a 2M pointer. I don't see how the system can benefit from having both ends
of the VMA cross Huge page boundary.

> --
> Kirill A. Shutemov

Thanks
Boaz

Michal Hocko Jan. 14, 2019, 4:40 p.m. UTC | #6

On Mon 14-01-19 16:29:29, Harrosh, Boaz wrote:
>  Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > On Fri, Jan 11, 2019 at 03:28:37PM -0800, Mike Kravetz wrote:
> >> Ok, I just wanted to ask the question.  I've seen application code doing
> >> the 'mmap sufficiently large area' then unmap to get desired alignment
> >> trick.  Was wondering if there was something we could do to help.
> >
> > Application may want to get aligned allocation for different reasons.
> > It should be okay for userspace to ask for size + (alignment - PAGE_SIZE)
> > and then round up the address to get the alignment. We basically do the
> > same on kernel side.
> >
> 
> This is what we do and will need to keep doing for old Kernels.
> But it is a pity that those holes can not be reused for small maps, and most important
> that we cannot have "mapping holes" around the mapping that catch memory
> overruns

What does prevent you from mapping a larger area and MAP_FIXED,
PROT_NONE over it to get the protection?
 
> > For THP, I believe, kernel already does The Right Thing™ for most users.
> > User still may want to get speific range as THP (to avoid false sharing or
> > something).
> 
> I'm an OK Kernel programmer.  But I was not able to create a HugePage mapping
> against /dev/shm/ in a reliable way. I think it only worked on Fedora 28/29
> but not on any other distro/version. (MMAP_HUGE)

Are you mixing hugetlb rather than THP?

> We run with our own compiled Kernel on various distros, THP is configured
> in but mmap against /dev/shm/ never gives me Huge pages. Does it only
> work with unanimous mmap ? (I think it is mount dependent which is not
> in the application control)

If you are talking about THP then you have to enable huge pages for the
mapping AFAIR.

Harrosh, Boaz Jan. 14, 2019, 4:40 p.m. UTC | #7

Sistare <steven.sistare@oracle.com> wrote:
> 
> A better heuristic would be to return an aligned address if the length
> is a multiple of the huge page size.  The gap (if any) between the end of
> the previous VMA and the start of this VMA would be filled by subsequent
> smaller mmap requests.  The new behavior would need to become part of the
> mmap interface definition so apps can rely on it and omit their hoop-jumping
> code.
> 

Yes that was my original request

> Personally I would like to see a new MAP_ALIGN flag and treat the addr
> argument as the alignment (like Solaris), 

Yes I would like that. So app can know when to do the old thing ...

> but I am told that adding flags
> is problematic because old kernels accept undefined flag bits from userland
> without complaint, so their behavior would change.
> 

There is already a mechanism in place since 4.14 I think or even before on
how to add new MMAP_XXX flags. This is done by combining MMAP_SHARED & MMAP_PRIVATE
flags together with the new set of flags. If there are present new flags this is allowed and means
requesting some new flag. Else and in old Kernels the combination above is not allowed in POSIX
and would fail in old Kernels.

Cheers
Boaz

> - Steve

Harrosh, Boaz Jan. 14, 2019, 4:54 p.m. UTC | #8

Michal Hocko <mhocko@kernel.org> wrote:

<>
> What does prevent you from mapping a larger area and MAP_FIXED,
> PROT_NONE over it to get the protection?

Yes Thanks I will try. That's good.

>> > For THP, I believe, kernel already does The Right Thing™ for most users.
>> > User still may want to get speific range as THP (to avoid false sharing or
>> > something).
>>
>> I'm an OK Kernel programmer.  But I was not able to create a HugePage mapping
>> against /dev/shm/ in a reliable way. I think it only worked on Fedora 28/29
>> but not on any other distro/version. (MMAP_HUGE)
>
> Are you mixing hugetlb rather than THP?

Probably. I was looking for the easiest way to get my mmap based memory allocations
to be 2M based instead of 4k. to get better IO characteristics across the Kernel.
But I kept getting the 4k pointers. (Can't really remember all the things I tried.)

>> We run with our own compiled Kernel on various distros, THP is configured
>> in but mmap against /dev/shm/ never gives me Huge pages. Does it only
>> work with unanimous mmap ? (I think it is mount dependent which is not
>> in the application control)
>
> If you are talking about THP then you have to enable huge pages for the
> mapping AFAIR.

This is exactly what I was looking to achieve but was not able to do. Most probably
a stupid omission on my part, but just to show that it is not that trivial and strait
out-of-the-man-page way to do it.  (Would love a code snippet if you ever wrote one?)

> --
> Michal Hocko
> SUSE Labs

Thanks man
Boaz

Michal Hocko Jan. 14, 2019, 6:02 p.m. UTC | #9

On Mon 14-01-19 16:54:02, Harrosh, Boaz wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
[...]
> >> We run with our own compiled Kernel on various distros, THP is configured
> >> in but mmap against /dev/shm/ never gives me Huge pages. Does it only
> >> work with unanimous mmap ? (I think it is mount dependent which is not
> >> in the application control)
> >
> > If you are talking about THP then you have to enable huge pages for the
> > mapping AFAIR.
> 
> This is exactly what I was looking to achieve but was not able to do. Most probably
> a stupid omission on my part, but just to show that it is not that trivial and strait
> out-of-the-man-page way to do it.  (Would love a code snippet if you ever wrote one?)

Have you tried
mount -t tmpfs -o huge=always none $MNT_POINT ?

It is true that man pages are silent about this but at least Documentation/admin-guide/mm/transhuge.rst
has an information. Time to send a patch to man pages I would say.

Mike Kravetz Jan. 14, 2019, 6:54 p.m. UTC | #10

On 1/14/19 7:35 AM, Steven Sistare wrote:
> On 1/11/2019 6:28 PM, Mike Kravetz wrote:
>> On 1/11/19 1:55 PM, Kirill A. Shutemov wrote:
>>> On Fri, Jan 11, 2019 at 08:10:03PM +0000, Mike Kravetz wrote:
>>>> At LPC last year, Boaz Harrosh asked why he had to 'jump through hoops'
>>>> to get an address returned by mmap() suitably aligned for THP.  It seems
>>>> that if mmap is asking for a mapping length greater than huge page
>>>> size, it should align the returned address to huge page size.
> 
> A better heuristic would be to return an aligned address if the length
> is a multiple of the huge page size.  The gap (if any) between the end of
> the previous VMA and the start of this VMA would be filled by subsequent
> smaller mmap requests.  The new behavior would need to become part of the
> mmap interface definition so apps can rely on it and omit their hoop-jumping
> code.

Yes, the heuristic really should be 'length is a multiple of the huge page
size'.  As you mention, this would still leave gaps.  I need to look closer
but this may not be any worse than the trick of mapping an area with rounded
up length and then unmapping pages at the beginning.

When I sent this out, the thought in the back of my mind was that this doesn't
really matter unless there is some type of alignment guarantee.  Otherwise,
user space code needs continue employing their code to check/force alignment.
Making matters somewhat worse is that I do not believe there is C interface to
query huge page size.  I thought there was discussion about adding one, but I
can not find it.

> Personally I would like to see a new MAP_ALIGN flag and treat the addr
> argument as the alignment (like Solaris), but I am told that adding flags
> is problematic because old kernels accept undefined flag bits from userland
> without complaint, so their behavior would change.

Well, a flag would clearly define desired behavior.

As others have been mentioned, there are mechanisms in place that allow user
space code to get the alignment it wants.  However, it is at the expense of
an additional system call or two.  Perhaps the question is, "Is it worth
defining new behavior to eliminate this overhead?".

One other thing to consider is that at mmap time, we likely do not know if
the vma will/can use THP.  We would know if system wide THP configuration
is set to never or always.  However, I 'think' the default for most distros
is madvize.  Therefore, it is not until a subsequent madvise call that we
know THP will be employed.  If the application code will need to make this
separate madvise call, then perhaps it is not too much to expect that it
take explicit action to optimally align the mapping.

Steven Sistare Jan. 14, 2019, 7:26 p.m. UTC | #11

On 1/14/2019 1:54 PM, Mike Kravetz wrote:
> On 1/14/19 7:35 AM, Steven Sistare wrote:
>> On 1/11/2019 6:28 PM, Mike Kravetz wrote:
>>> On 1/11/19 1:55 PM, Kirill A. Shutemov wrote:
>>>> On Fri, Jan 11, 2019 at 08:10:03PM +0000, Mike Kravetz wrote:
>>>>> At LPC last year, Boaz Harrosh asked why he had to 'jump through hoops'
>>>>> to get an address returned by mmap() suitably aligned for THP.  It seems
>>>>> that if mmap is asking for a mapping length greater than huge page
>>>>> size, it should align the returned address to huge page size.
>>
>> A better heuristic would be to return an aligned address if the length
>> is a multiple of the huge page size.  The gap (if any) between the end of
>> the previous VMA and the start of this VMA would be filled by subsequent
>> smaller mmap requests.  The new behavior would need to become part of the
>> mmap interface definition so apps can rely on it and omit their hoop-jumping
>> code.
> 
> Yes, the heuristic really should be 'length is a multiple of the huge page
> size'.  As you mention, this would still leave gaps.  I need to look closer
> but this may not be any worse than the trick of mapping an area with rounded
> up length and then unmapping pages at the beginning.
> 
> When I sent this out, the thought in the back of my mind was that this doesn't
> really matter unless there is some type of alignment guarantee.  Otherwise,
> user space code needs continue employing their code to check/force alignment.
> Making matters somewhat worse is that I do not believe there is C interface to
> query huge page size.  I thought there was discussion about adding one, but I
> can not find it.

Right. Solaris provides getpagesizes().

>> Personally I would like to see a new MAP_ALIGN flag and treat the addr
>> argument as the alignment (like Solaris), but I am told that adding flags
>> is problematic because old kernels accept undefined flag bits from userland
>> without complaint, so their behavior would change.
> 
> Well, a flag would clearly define desired behavior.
> 
> As others have been mentioned, there are mechanisms in place that allow user
> space code to get the alignment it wants.  However, it is at the expense of
> an additional system call or two.  Perhaps the question is, "Is it worth
> defining new behavior to eliminate this overhead?".
> 
> One other thing to consider is that at mmap time, we likely do not know if
> the vma will/can use THP.  We would know if system wide THP configuration
> is set to never or always.  However, I 'think' the default for most distros
> is madvize.  Therefore, it is not until a subsequent madvise call that we
> know THP will be employed.  If the application code will need to make this
> separate madvise call, then perhaps it is not too much to expect that it
> take explicit action to optimally align the mapping.

True.  It is annoying to write the extra code, but the power user will do it.

The heuristic alignment would primarily benefit applications that are not as
carefully optimized.

- Steve

Kirill A . Shutemov Jan. 15, 2019, 8:24 a.m. UTC | #12

On Mon, Jan 14, 2019 at 10:54:45AM -0800, Mike Kravetz wrote:
> On 1/14/19 7:35 AM, Steven Sistare wrote:
> > On 1/11/2019 6:28 PM, Mike Kravetz wrote:
> >> On 1/11/19 1:55 PM, Kirill A. Shutemov wrote:
> >>> On Fri, Jan 11, 2019 at 08:10:03PM +0000, Mike Kravetz wrote:
> >>>> At LPC last year, Boaz Harrosh asked why he had to 'jump through hoops'
> >>>> to get an address returned by mmap() suitably aligned for THP.  It seems
> >>>> that if mmap is asking for a mapping length greater than huge page
> >>>> size, it should align the returned address to huge page size.
> > 
> > A better heuristic would be to return an aligned address if the length
> > is a multiple of the huge page size.  The gap (if any) between the end of
> > the previous VMA and the start of this VMA would be filled by subsequent
> > smaller mmap requests.  The new behavior would need to become part of the
> > mmap interface definition so apps can rely on it and omit their hoop-jumping
> > code.
> 
> Yes, the heuristic really should be 'length is a multiple of the huge page
> size'.  As you mention, this would still leave gaps.  I need to look closer
> but this may not be any worse than the trick of mapping an area with rounded
> up length and then unmapping pages at the beginning.

The question why is it any better. Virtual address space is generally
cheap, additional VMA maybe more signficiant due to find_vma() overhead.

And you don't *need* to unmap anything. Just use alinged pointer.

> 
> When I sent this out, the thought in the back of my mind was that this doesn't
> really matter unless there is some type of alignment guarantee.  Otherwise,
> user space code needs continue employing their code to check/force alignment.
> Making matters somewhat worse is that I do not believe there is C interface to
> query huge page size.  I thought there was discussion about adding one, but I
> can not find it.

We have posix_memalign(3).

Mike Kravetz Jan. 15, 2019, 6:08 p.m. UTC | #13

On 1/15/19 12:24 AM, Kirill A. Shutemov wrote:
> On Mon, Jan 14, 2019 at 10:54:45AM -0800, Mike Kravetz wrote:
>> On 1/14/19 7:35 AM, Steven Sistare wrote:
>>> On 1/11/2019 6:28 PM, Mike Kravetz wrote:
>>>> On 1/11/19 1:55 PM, Kirill A. Shutemov wrote:
>>>>> On Fri, Jan 11, 2019 at 08:10:03PM +0000, Mike Kravetz wrote:
>>>>>> At LPC last year, Boaz Harrosh asked why he had to 'jump through hoops'
>>>>>> to get an address returned by mmap() suitably aligned for THP.  It seems
>>>>>> that if mmap is asking for a mapping length greater than huge page
>>>>>> size, it should align the returned address to huge page size.
>>>
>>> A better heuristic would be to return an aligned address if the length
>>> is a multiple of the huge page size.  The gap (if any) between the end of
>>> the previous VMA and the start of this VMA would be filled by subsequent
>>> smaller mmap requests.  The new behavior would need to become part of the
>>> mmap interface definition so apps can rely on it and omit their hoop-jumping
>>> code.
>>
>> Yes, the heuristic really should be 'length is a multiple of the huge page
>> size'.  As you mention, this would still leave gaps.  I need to look closer
>> but this may not be any worse than the trick of mapping an area with rounded
>> up length and then unmapping pages at the beginning.
> 
> The question why is it any better. Virtual address space is generally
> cheap, additional VMA maybe more signficiant due to find_vma() overhead.
> 
> And you don't *need* to unmap anything. Just use alinged pointer.

You are correct, it is not any better.

I know you do not need to unmap anything.  However, I believe people are
writing code which does this today.  For example, qemu's qemu_ram_mmap()
utility routine does this, but it may have other reasons for creating
the gap.

Thanks for all of the feedback.  I do not think there is anything we can
or should do in this area.  As Steve said, 'power users' who want to get
optimal THP usage will write the code to make that happen.

[RFC] mm: align anon mmap for THP

Commit Message

Comments

Patch