diff mbox series

[V2] mm/page_alloc: Add alloc_contig_pages()

Message ID 1571223765-10662-1-git-send-email-anshuman.khandual@arm.com (mailing list archive)
State New, archived
Headers show
Series [V2] mm/page_alloc: Add alloc_contig_pages() | expand

Commit Message

Anshuman Khandual Oct. 16, 2019, 11:02 a.m. UTC
HugeTLB helper alloc_gigantic_page() implements fairly generic allocation
method where it scans over various zones looking for a large contiguous pfn
range before trying to allocate it with alloc_contig_range(). Other than
deriving the requested order from 'struct hstate', there is nothing HugeTLB
specific in there. This can be made available for general use to allocate
contiguous memory which could not have been allocated through the buddy
allocator.

alloc_gigantic_page() has been split carving out actual allocation method
which is then made available via new alloc_contig_pages() helper wrapped
under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced
with more generic term 'contig'. Allocated pages here should be freed with
free_contig_range() or by calling __free_page() on each allocated page.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
This is based on https://patchwork.kernel.org/patch/11190213/

Changes in V2:

- Rephrased patch subject per David
- Fixed all typos per David
- s/order/contiguous

Changes from [V5,1/2] mm/hugetlb: Make alloc_gigantic_page()...

- alloc_contig_page() takes nr_pages instead of order per Michal
- s/gigantic/contig on all related functions

 include/linux/gfp.h |  2 +
 mm/hugetlb.c        | 77 +----------------------------------
 mm/page_alloc.c     | 97 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 101 insertions(+), 75 deletions(-)

Comments

David Hildenbrand Oct. 16, 2019, 11:09 a.m. UTC | #1
On 16.10.19 13:02, Anshuman Khandual wrote:
> HugeTLB helper alloc_gigantic_page() implements fairly generic allocation
> method where it scans over various zones looking for a large contiguous pfn
> range before trying to allocate it with alloc_contig_range(). Other than
> deriving the requested order from 'struct hstate', there is nothing HugeTLB
> specific in there. This can be made available for general use to allocate
> contiguous memory which could not have been allocated through the buddy
> allocator.
> 
> alloc_gigantic_page() has been split carving out actual allocation method
> which is then made available via new alloc_contig_pages() helper wrapped
> under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced
> with more generic term 'contig'. Allocated pages here should be freed with
> free_contig_range() or by calling __free_page() on each allocated page.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: linux-kernel@vger.kernel.org
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
> This is based on https://patchwork.kernel.org/patch/11190213/
> 
> Changes in V2:
> 
> - Rephrased patch subject per David
> - Fixed all typos per David
> - s/order/contiguous

Just to make sure, you ignored my comment regarding alignment although I 
explicitly mentioned it a second time? Thanks.

> 
> Changes from [V5,1/2] mm/hugetlb: Make alloc_gigantic_page()...
> 
> - alloc_contig_page() takes nr_pages instead of order per Michal
> - s/gigantic/contig on all related functions
> 
>   include/linux/gfp.h |  2 +
>   mm/hugetlb.c        | 77 +----------------------------------
>   mm/page_alloc.c     | 97 +++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 101 insertions(+), 75 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index fb07b503dc45..1a11d4857027 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -589,6 +589,8 @@ static inline bool pm_suspended_storage(void)
>   /* The below functions must be run on a range from a single zone. */
>   extern int alloc_contig_range(unsigned long start, unsigned long end,
>   			      unsigned migratetype, gfp_t gfp_mask);
> +extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
> +				       int nid, nodemask_t *nodemask);
>   #endif
>   void free_contig_range(unsigned long pfn, unsigned int nr_pages);
>   
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 985ee15eb04b..a5c2c880af27 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1023,85 +1023,12 @@ static void free_gigantic_page(struct page *page, unsigned int order)
>   }
>   
>   #ifdef CONFIG_CONTIG_ALLOC
> -static int __alloc_gigantic_page(unsigned long start_pfn,
> -				unsigned long nr_pages, gfp_t gfp_mask)
> -{
> -	unsigned long end_pfn = start_pfn + nr_pages;
> -	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
> -				  gfp_mask);
> -}
> -
> -static bool pfn_range_valid_gigantic(struct zone *z,
> -			unsigned long start_pfn, unsigned long nr_pages)
> -{
> -	unsigned long i, end_pfn = start_pfn + nr_pages;
> -	struct page *page;
> -
> -	for (i = start_pfn; i < end_pfn; i++) {
> -		page = pfn_to_online_page(i);
> -		if (!page)
> -			return false;
> -
> -		if (page_zone(page) != z)
> -			return false;
> -
> -		if (PageReserved(page))
> -			return false;
> -
> -		if (page_count(page) > 0)
> -			return false;
> -
> -		if (PageHuge(page))
> -			return false;
> -	}
> -
> -	return true;
> -}
> -
> -static bool zone_spans_last_pfn(const struct zone *zone,
> -			unsigned long start_pfn, unsigned long nr_pages)
> -{
> -	unsigned long last_pfn = start_pfn + nr_pages - 1;
> -	return zone_spans_pfn(zone, last_pfn);
> -}
> -
>   static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
>   		int nid, nodemask_t *nodemask)
>   {
> -	unsigned int order = huge_page_order(h);
> -	unsigned long nr_pages = 1 << order;
> -	unsigned long ret, pfn, flags;
> -	struct zonelist *zonelist;
> -	struct zone *zone;
> -	struct zoneref *z;
> -
> -	zonelist = node_zonelist(nid, gfp_mask);
> -	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nodemask) {
> -		spin_lock_irqsave(&zone->lock, flags);
> +	unsigned long nr_pages = 1UL << huge_page_order(h);
>   
> -		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
> -		while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
> -			if (pfn_range_valid_gigantic(zone, pfn, nr_pages)) {
> -				/*
> -				 * We release the zone lock here because
> -				 * alloc_contig_range() will also lock the zone
> -				 * at some point. If there's an allocation
> -				 * spinning on this lock, it may win the race
> -				 * and cause alloc_contig_range() to fail...
> -				 */
> -				spin_unlock_irqrestore(&zone->lock, flags);
> -				ret = __alloc_gigantic_page(pfn, nr_pages, gfp_mask);
> -				if (!ret)
> -					return pfn_to_page(pfn);
> -				spin_lock_irqsave(&zone->lock, flags);
> -			}
> -			pfn += nr_pages;
> -		}
> -
> -		spin_unlock_irqrestore(&zone->lock, flags);
> -	}
> -
> -	return NULL;
> +	return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
>   }
>   
>   static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cd1dd0712624..880bcbc5380d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8499,6 +8499,103 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>   				pfn_max_align_up(end), migratetype);
>   	return ret;
>   }
> +
> +static int __alloc_contig_pages(unsigned long start_pfn,
> +				unsigned long nr_pages, gfp_t gfp_mask)
> +{
> +	unsigned long end_pfn = start_pfn + nr_pages;
> +
> +	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
> +				  gfp_mask);
> +}
> +
> +static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
> +				   unsigned long nr_pages)
> +{
> +	unsigned long i, end_pfn = start_pfn + nr_pages;
> +	struct page *page;
> +
> +	for (i = start_pfn; i < end_pfn; i++) {
> +		page = pfn_to_online_page(i);
> +		if (!page)
> +			return false;
> +
> +		if (page_zone(page) != z)
> +			return false;
> +
> +		if (PageReserved(page))
> +			return false;
> +
> +		if (page_count(page) > 0)
> +			return false;
> +
> +		if (PageHuge(page))
> +			return false;
> +	}
> +	return true;
> +}
> +
> +static bool zone_spans_last_pfn(const struct zone *zone,
> +				unsigned long start_pfn, unsigned long nr_pages)
> +{
> +	unsigned long last_pfn = start_pfn + nr_pages - 1;
> +
> +	return zone_spans_pfn(zone, last_pfn);
> +}
> +
> +/**
> + * alloc_contig_pages() -- tries to find and allocate contiguous range of pages
> + * @nr_pages:	Number of contiguous pages to allocate
> + * @gfp_mask:	GFP mask to limit search and used during compaction
> + * @nid:	Target node
> + * @nodemask:	Mask for other possible nodes
> + *
> + * This routine is a wrapper around alloc_contig_range(). It scans over zones
> + * on an applicable zonelist to find a contiguous pfn range which can then be
> + * tried for allocation with alloc_contig_range(). This routine is intended
> + * for allocation requests which can not be fulfilled with the buddy allocator.
> + *
> + * Allocated pages can be freed with free_contig_range() or by manually calling
> + * __free_page() on each allocated page.
> + *
> + * Return: pointer to contiguous pages on success, or NULL if not successful.
> + */
> +struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
> +				int nid, nodemask_t *nodemask)
> +{
> +	unsigned long ret, pfn, flags;
> +	struct zonelist *zonelist;
> +	struct zone *zone;
> +	struct zoneref *z;
> +
> +	zonelist = node_zonelist(nid, gfp_mask);
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> +					gfp_zone(gfp_mask), nodemask) {
> +		spin_lock_irqsave(&zone->lock, flags);
> +
> +		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
> +		while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
> +			if (pfn_range_valid_contig(zone, pfn, nr_pages)) {
> +				/*
> +				 * We release the zone lock here because
> +				 * alloc_contig_range() will also lock the zone
> +				 * at some point. If there's an allocation
> +				 * spinning on this lock, it may win the race
> +				 * and cause alloc_contig_range() to fail...
> +				 */
> +				spin_unlock_irqrestore(&zone->lock, flags);
> +				ret = __alloc_contig_pages(pfn, nr_pages,
> +							gfp_mask);
> +				if (!ret)
> +					return pfn_to_page(pfn);
> +				spin_lock_irqsave(&zone->lock, flags);
> +			}
> +			pfn += nr_pages;
> +		}
> +		spin_unlock_irqrestore(&zone->lock, flags);
> +	}
> +	return NULL;
> +}
>   #endif /* CONFIG_CONTIG_ALLOC */
>   
>   void free_contig_range(unsigned long pfn, unsigned int nr_pages)
>
Anshuman Khandual Oct. 16, 2019, 11:13 a.m. UTC | #2
On 10/16/2019 04:39 PM, David Hildenbrand wrote:
> On 16.10.19 13:02, Anshuman Khandual wrote:
>> HugeTLB helper alloc_gigantic_page() implements fairly generic allocation
>> method where it scans over various zones looking for a large contiguous pfn
>> range before trying to allocate it with alloc_contig_range(). Other than
>> deriving the requested order from 'struct hstate', there is nothing HugeTLB
>> specific in there. This can be made available for general use to allocate
>> contiguous memory which could not have been allocated through the buddy
>> allocator.
>>
>> alloc_gigantic_page() has been split carving out actual allocation method
>> which is then made available via new alloc_contig_pages() helper wrapped
>> under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced
>> with more generic term 'contig'. Allocated pages here should be freed with
>> free_contig_range() or by calling __free_page() on each allocated page.
>>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: linux-kernel@vger.kernel.org
>> Acked-by: Michal Hocko <mhocko@suse.com>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>> This is based on https://patchwork.kernel.org/patch/11190213/
>>
>> Changes in V2:
>>
>> - Rephrased patch subject per David
>> - Fixed all typos per David
>> - s/order/contiguous
> 
> Just to make sure, you ignored my comment regarding alignment although I explicitly mentioned it a second time? Thanks.

I had asked Michal explicitly what to be included for the respin. Anyways
seems like the previous thread is active again. I am happy to incorporate
anything new getting agreed on there.

- Anshuman
Michal Hocko Oct. 16, 2019, 11:51 a.m. UTC | #3
On Wed 16-10-19 16:43:57, Anshuman Khandual wrote:
> 
> 
> On 10/16/2019 04:39 PM, David Hildenbrand wrote:
[...]
> > Just to make sure, you ignored my comment regarding alignment
> > although I explicitly mentioned it a second time? Thanks.
> 
> I had asked Michal explicitly what to be included for the respin. Anyways
> seems like the previous thread is active again. I am happy to incorporate
> anything new getting agreed on there.

Your patch is using the same alignment as the original code would do. If
an explicit alignement is needed then this can be added on top, right?
David Hildenbrand Oct. 16, 2019, 12:29 p.m. UTC | #4
On 16.10.19 13:51, Michal Hocko wrote:
> On Wed 16-10-19 16:43:57, Anshuman Khandual wrote:
>>
>>
>> On 10/16/2019 04:39 PM, David Hildenbrand wrote:
> [...]
>>> Just to make sure, you ignored my comment regarding alignment
>>> although I explicitly mentioned it a second time? Thanks.
>>
>> I had asked Michal explicitly what to be included for the respin. Anyways
>> seems like the previous thread is active again. I am happy to incorporate
>> anything new getting agreed on there.
> 
> Your patch is using the same alignment as the original code would do. If
> an explicit alignement is needed then this can be added on top, right?
> 

Again, the "issue" I see here is that we could now pass in numbers that 
are not a power of two. For gigantic pages it was clear that we always 
have a number of two. The alignment does not make any sense otherwise.

What I'm asking for is

a) Document "The resulting PFN is aligned to nr_pages" and "nr_pages 
should be a power of two".

b) Eventually adding something like

if (WARN_ON_ONCE(!is_power_of_2(nr_pages)))
	return NULL;

Once we need other granularities we can think about the alignment and an 
optimized search-process.
Michal Hocko Oct. 16, 2019, 12:41 p.m. UTC | #5
On Wed 16-10-19 14:29:05, David Hildenbrand wrote:
> On 16.10.19 13:51, Michal Hocko wrote:
> > On Wed 16-10-19 16:43:57, Anshuman Khandual wrote:
> > > 
> > > 
> > > On 10/16/2019 04:39 PM, David Hildenbrand wrote:
> > [...]
> > > > Just to make sure, you ignored my comment regarding alignment
> > > > although I explicitly mentioned it a second time? Thanks.
> > > 
> > > I had asked Michal explicitly what to be included for the respin. Anyways
> > > seems like the previous thread is active again. I am happy to incorporate
> > > anything new getting agreed on there.
> > 
> > Your patch is using the same alignment as the original code would do. If
> > an explicit alignement is needed then this can be added on top, right?
> > 
> 
> Again, the "issue" I see here is that we could now pass in numbers that are
> not a power of two. For gigantic pages it was clear that we always have a
> number of two. The alignment does not make any sense otherwise.
> 
> What I'm asking for is
> 
> a) Document "The resulting PFN is aligned to nr_pages" and "nr_pages should
> be a power of two".

OK, this makes sense.

> b) Eventually adding something like
> 
> if (WARN_ON_ONCE(!is_power_of_2(nr_pages)))
> 	return NULL;

I am not sure this is really needed.
Anshuman Khandual Oct. 16, 2019, 3:31 p.m. UTC | #6
On 10/16/2019 06:11 PM, Michal Hocko wrote:
> On Wed 16-10-19 14:29:05, David Hildenbrand wrote:
>> On 16.10.19 13:51, Michal Hocko wrote:
>>> On Wed 16-10-19 16:43:57, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 10/16/2019 04:39 PM, David Hildenbrand wrote:
>>> [...]
>>>>> Just to make sure, you ignored my comment regarding alignment
>>>>> although I explicitly mentioned it a second time? Thanks.
>>>>
>>>> I had asked Michal explicitly what to be included for the respin. Anyways
>>>> seems like the previous thread is active again. I am happy to incorporate
>>>> anything new getting agreed on there.
>>>
>>> Your patch is using the same alignment as the original code would do. If
>>> an explicit alignement is needed then this can be added on top, right?
>>>
>>
>> Again, the "issue" I see here is that we could now pass in numbers that are
>> not a power of two. For gigantic pages it was clear that we always have a
>> number of two. The alignment does not make any sense otherwise.

ALIGN() does expect nr_pages two be power of two otherwise the mask
value might not be correct, affecting start pfn value for a zone.

#define ALIGN(x, a)             	__ALIGN_KERNEL((x), (a))
#define __ALIGN_KERNEL(x, a)            __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
#define __ALIGN_KERNEL_MASK(x, mask)    (((x) + (mask)) & ~(mask))

>>
>> What I'm asking for is
>>
>> a) Document "The resulting PFN is aligned to nr_pages" and "nr_pages should
>> be a power of two".
> 
> OK, this makes sense.
Sure, will add this to the alloc_contig_pages() helper description and
in the commit message as well.

> 
>> b) Eventually adding something like
>>
>> if (WARN_ON_ONCE(!is_power_of_2(nr_pages)))
>> 	return NULL;
> 
> I am not sure this is really needed.
> 
Just wondering why not ? Ideally if we are documenting that nr_pages should be
power of two, then we should abort erring allocation request with an warning. Is
it because allocation can still succeed for non-power-of-two requests despite
possible problem with mask and alignment ? Hence there is no need to abort it.
Michal Hocko Oct. 16, 2019, 3:45 p.m. UTC | #7
On Wed 16-10-19 21:01:19, Anshuman Khandual wrote:
> 
> 
> On 10/16/2019 06:11 PM, Michal Hocko wrote:
> > On Wed 16-10-19 14:29:05, David Hildenbrand wrote:
> >> On 16.10.19 13:51, Michal Hocko wrote:
> >>> On Wed 16-10-19 16:43:57, Anshuman Khandual wrote:
> >>>>
> >>>>
> >>>> On 10/16/2019 04:39 PM, David Hildenbrand wrote:
> >>> [...]
> >>>>> Just to make sure, you ignored my comment regarding alignment
> >>>>> although I explicitly mentioned it a second time? Thanks.
> >>>>
> >>>> I had asked Michal explicitly what to be included for the respin. Anyways
> >>>> seems like the previous thread is active again. I am happy to incorporate
> >>>> anything new getting agreed on there.
> >>>
> >>> Your patch is using the same alignment as the original code would do. If
> >>> an explicit alignement is needed then this can be added on top, right?
> >>>
> >>
> >> Again, the "issue" I see here is that we could now pass in numbers that are
> >> not a power of two. For gigantic pages it was clear that we always have a
> >> number of two. The alignment does not make any sense otherwise.
> 
> ALIGN() does expect nr_pages two be power of two otherwise the mask
> value might not be correct, affecting start pfn value for a zone.
> 
> #define ALIGN(x, a)             	__ALIGN_KERNEL((x), (a))
> #define __ALIGN_KERNEL(x, a)            __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
> #define __ALIGN_KERNEL_MASK(x, mask)    (((x) + (mask)) & ~(mask))

Yes it doesn't really provide a sensible result but does it matter?
 
> >> What I'm asking for is
> >>
> >> a) Document "The resulting PFN is aligned to nr_pages" and "nr_pages should
> >> be a power of two".
> > 
> > OK, this makes sense.
> Sure, will add this to the alloc_contig_pages() helper description and
> in the commit message as well.
> 
> > 
> >> b) Eventually adding something like
> >>
> >> if (WARN_ON_ONCE(!is_power_of_2(nr_pages)))
> >> 	return NULL;
> > 
> > I am not sure this is really needed.
> > 
> Just wondering why not ? Ideally if we are documenting that nr_pages should be
> power of two, then we should abort erring allocation request with an warning. Is
> it because allocation can still succeed for non-power-of-two requests despite
> possible problem with mask and alignment ? Hence there is no need to abort it.

What would we do about the warning? There are only three ways to go
about this a) reguire power of two nr_pages and always align to that b)
do not restrict nr_pages and align implicitly to what makes sense (power
of two on proper size and arbitrary page aligned otherwise) and c) allow
an explicit alignment.

The simplest way forward is b) because that doesn't really require any
code changes. And I thought the main point of this patch was to provide
something as simple as possible. a) would be only slightly more
complicated but I am wondering why should be restrict the size of the
allocation when there is nothing inherent to do so.
David Hildenbrand Oct. 16, 2019, 4:48 p.m. UTC | #8
On 16.10.19 17:31, Anshuman Khandual wrote:
> 
> 
> On 10/16/2019 06:11 PM, Michal Hocko wrote:
>> On Wed 16-10-19 14:29:05, David Hildenbrand wrote:
>>> On 16.10.19 13:51, Michal Hocko wrote:
>>>> On Wed 16-10-19 16:43:57, Anshuman Khandual wrote:
>>>>>
>>>>>
>>>>> On 10/16/2019 04:39 PM, David Hildenbrand wrote:
>>>> [...]
>>>>>> Just to make sure, you ignored my comment regarding alignment
>>>>>> although I explicitly mentioned it a second time? Thanks.
>>>>>
>>>>> I had asked Michal explicitly what to be included for the respin. Anyways
>>>>> seems like the previous thread is active again. I am happy to incorporate
>>>>> anything new getting agreed on there.
>>>>
>>>> Your patch is using the same alignment as the original code would do. If
>>>> an explicit alignement is needed then this can be added on top, right?
>>>>
>>>
>>> Again, the "issue" I see here is that we could now pass in numbers that are
>>> not a power of two. For gigantic pages it was clear that we always have a
>>> number of two. The alignment does not make any sense otherwise.
> 
> ALIGN() does expect nr_pages two be power of two otherwise the mask
> value might not be correct, affecting start pfn value for a zone.
> 
> #define ALIGN(x, a)             	__ALIGN_KERNEL((x), (a))
> #define __ALIGN_KERNEL(x, a)            __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
> #define __ALIGN_KERNEL_MASK(x, mask)    (((x) + (mask)) & ~(mask))
> 
>>>
>>> What I'm asking for is
>>>
>>> a) Document "The resulting PFN is aligned to nr_pages" and "nr_pages should
>>> be a power of two".
>>
>> OK, this makes sense.
> Sure, will add this to the alloc_contig_pages() helper description and
> in the commit message as well.

As long as it is documented that implicit alignment will happen, fine 
with me.

The thing about !is_power_of2() is that we usually don't need an 
alignment there (or instead an explicit one). And as I mentioned, the 
current function might fail easily to allocate a suitable range due to 
the way the search works (== check aligned blocks only). The search 
really only provides reliable results when size==alignment and it's a 
power of two IMHO. Not documenting that is in my opinion misleading - 
somebody who wants !is_power_of2() and has no alignment requirements 
should probably rework the function first.

So with some documentation regarding that

Acked-by: David Hildenbrand <david@redhat.com>
Mike Kravetz Oct. 17, 2019, 12:50 a.m. UTC | #9
On 10/16/19 4:02 AM, Anshuman Khandual wrote:
> HugeTLB helper alloc_gigantic_page() implements fairly generic allocation
> method where it scans over various zones looking for a large contiguous pfn
> range before trying to allocate it with alloc_contig_range(). Other than
> deriving the requested order from 'struct hstate', there is nothing HugeTLB
> specific in there. This can be made available for general use to allocate
> contiguous memory which could not have been allocated through the buddy
> allocator.
> 
> alloc_gigantic_page() has been split carving out actual allocation method
> which is then made available via new alloc_contig_pages() helper wrapped
> under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced
> with more generic term 'contig'. Allocated pages here should be freed with
> free_contig_range() or by calling __free_page() on each allocated page.

I had a 'test harness' used when previously working on such an interface.
It simply provides a user interface to call the allocator with random
values for nr_pages.  Several tasks are then started doing random allocations
in parallel.  The new interface is pretty straight forward, but the idea
was to stress the underlying code.  In fact, it did identify issues with
isolation which were corrected.

I exercised this new interface in the same way and am happy to report that
no issues were detected.
Michal Hocko Oct. 17, 2019, 7:11 a.m. UTC | #10
On Thu 17-10-19 10:44:41, Anshuman Khandual wrote:
[...]
> Does this add-on documentation look okay ? Should we also mention about the
> possible reduction in chances of success during pfn block search for the
> non-power-of-two cases as the implicit alignment will probably turn out to
> be bigger than nr_pages itself ?
> 
>  * Requested nr_pages may or may not be power of two. The search for suitable
>  * memory range in a zone happens in nr_pages aligned pfn blocks. But in case
>  * when nr_pages is not power of two, an implicitly aligned pfn block search
>  * will happen which in turn will impact allocated memory block's alignment.
>  * In these cases, the size (i.e nr_pages) and the alignment of the allocated
>  * memory will be different. This problem does not exist when nr_pages is power
>  * of two where the size and the alignment of the allocated memory will always
>  * be nr_pages.

I dunno, it sounds more complicated than really necessary IMHO. Callers
shouldn't really be bothered by memory blocks and other really deep
implementation details.. Wouldn't be the below sufficient?

The allocated memory is always aligned to a page boundary. If nr_pages
is a power of two then the alignement is guaranteed to be to the given
nr_pages (e.g. 1GB request would be aligned to 1GB).
David Hildenbrand Oct. 17, 2019, 7:21 a.m. UTC | #11
On 17.10.19 09:11, Michal Hocko wrote:
> On Thu 17-10-19 10:44:41, Anshuman Khandual wrote:
> [...]
>> Does this add-on documentation look okay ? Should we also mention about the
>> possible reduction in chances of success during pfn block search for the
>> non-power-of-two cases as the implicit alignment will probably turn out to
>> be bigger than nr_pages itself ?
>>
>>   * Requested nr_pages may or may not be power of two. The search for suitable
>>   * memory range in a zone happens in nr_pages aligned pfn blocks. But in case
>>   * when nr_pages is not power of two, an implicitly aligned pfn block search
>>   * will happen which in turn will impact allocated memory block's alignment.
>>   * In these cases, the size (i.e nr_pages) and the alignment of the allocated
>>   * memory will be different. This problem does not exist when nr_pages is power
>>   * of two where the size and the alignment of the allocated memory will always
>>   * be nr_pages.
> 
> I dunno, it sounds more complicated than really necessary IMHO. Callers
> shouldn't really be bothered by memory blocks and other really deep
> implementation details.. Wouldn't be the below sufficient?
> 
> The allocated memory is always aligned to a page boundary. If nr_pages
> is a power of two then the alignement is guaranteed to be to the given

s/alignement/alignment/

and "the PFN is guaranteed to be aligned to nr_pages" (the address is 
aligned to nr_pages*PAGE_SIZE)

> nr_pages (e.g. 1GB request would be aligned to 1GB).
> 

I'd probably add "This function will miss allocation opportunities if 
nr_pages is not a power of two (and the implicit alignment is bogus)."
Michal Hocko Oct. 17, 2019, 7:34 a.m. UTC | #12
On Thu 17-10-19 09:21:24, David Hildenbrand wrote:
> On 17.10.19 09:11, Michal Hocko wrote:
> > On Thu 17-10-19 10:44:41, Anshuman Khandual wrote:
> > [...]
> > > Does this add-on documentation look okay ? Should we also mention about the
> > > possible reduction in chances of success during pfn block search for the
> > > non-power-of-two cases as the implicit alignment will probably turn out to
> > > be bigger than nr_pages itself ?
> > > 
> > >   * Requested nr_pages may or may not be power of two. The search for suitable
> > >   * memory range in a zone happens in nr_pages aligned pfn blocks. But in case
> > >   * when nr_pages is not power of two, an implicitly aligned pfn block search
> > >   * will happen which in turn will impact allocated memory block's alignment.
> > >   * In these cases, the size (i.e nr_pages) and the alignment of the allocated
> > >   * memory will be different. This problem does not exist when nr_pages is power
> > >   * of two where the size and the alignment of the allocated memory will always
> > >   * be nr_pages.
> > 
> > I dunno, it sounds more complicated than really necessary IMHO. Callers
> > shouldn't really be bothered by memory blocks and other really deep
> > implementation details.. Wouldn't be the below sufficient?
> > 
> > The allocated memory is always aligned to a page boundary. If nr_pages
> > is a power of two then the alignement is guaranteed to be to the given
> 
> s/alignement/alignment/
> 
> and "the PFN is guaranteed to be aligned to nr_pages" (the address is
> aligned to nr_pages*PAGE_SIZE)

thx for the correction.

> > nr_pages (e.g. 1GB request would be aligned to 1GB).
> > 
> 
> I'd probably add "This function will miss allocation opportunities if
> nr_pages is not a power of two (and the implicit alignment is bogus)."

This is again an implementation detail and quite a confusing one to
whoever not familiar with the MM internals. And to be fair even a proper
alignment doesn't give you any stronger guarantee as long as the
allocation operates on non movable zones anyway.
David Hildenbrand Oct. 17, 2019, 7:38 a.m. UTC | #13
On 17.10.19 09:34, Michal Hocko wrote:
> On Thu 17-10-19 09:21:24, David Hildenbrand wrote:
>> On 17.10.19 09:11, Michal Hocko wrote:
>>> On Thu 17-10-19 10:44:41, Anshuman Khandual wrote:
>>> [...]
>>>> Does this add-on documentation look okay ? Should we also mention about the
>>>> possible reduction in chances of success during pfn block search for the
>>>> non-power-of-two cases as the implicit alignment will probably turn out to
>>>> be bigger than nr_pages itself ?
>>>>
>>>>    * Requested nr_pages may or may not be power of two. The search for suitable
>>>>    * memory range in a zone happens in nr_pages aligned pfn blocks. But in case
>>>>    * when nr_pages is not power of two, an implicitly aligned pfn block search
>>>>    * will happen which in turn will impact allocated memory block's alignment.
>>>>    * In these cases, the size (i.e nr_pages) and the alignment of the allocated
>>>>    * memory will be different. This problem does not exist when nr_pages is power
>>>>    * of two where the size and the alignment of the allocated memory will always
>>>>    * be nr_pages.
>>>
>>> I dunno, it sounds more complicated than really necessary IMHO. Callers
>>> shouldn't really be bothered by memory blocks and other really deep
>>> implementation details.. Wouldn't be the below sufficient?
>>>
>>> The allocated memory is always aligned to a page boundary. If nr_pages
>>> is a power of two then the alignement is guaranteed to be to the given
>>
>> s/alignement/alignment/
>>
>> and "the PFN is guaranteed to be aligned to nr_pages" (the address is
>> aligned to nr_pages*PAGE_SIZE)
> 
> thx for the correction.
> 
>>> nr_pages (e.g. 1GB request would be aligned to 1GB).
>>>
>>
>> I'd probably add "This function will miss allocation opportunities if
>> nr_pages is not a power of two (and the implicit alignment is bogus)."
> 
> This is again an implementation detail and quite a confusing one to
> whoever not familiar with the MM internals. And to be fair even a proper
> alignment doesn't give you any stronger guarantee as long as the
> allocation operates on non movable zones anyway.
> 

To be honest, I'd not suggest to anyone to use this function with 
nr_pages not being a power of two, and I already explained why. I prefer 
to spill that out than having people complain afterwards. Yes it's an 
implementation detail users should be aware of until reworked.

But I think we talked about this here for way too long, so I am fine 
with either.
diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fb07b503dc45..1a11d4857027 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -589,6 +589,8 @@  static inline bool pm_suspended_storage(void)
 /* The below functions must be run on a range from a single zone. */
 extern int alloc_contig_range(unsigned long start, unsigned long end,
 			      unsigned migratetype, gfp_t gfp_mask);
+extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
+				       int nid, nodemask_t *nodemask);
 #endif
 void free_contig_range(unsigned long pfn, unsigned int nr_pages);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 985ee15eb04b..a5c2c880af27 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1023,85 +1023,12 @@  static void free_gigantic_page(struct page *page, unsigned int order)
 }
 
 #ifdef CONFIG_CONTIG_ALLOC
-static int __alloc_gigantic_page(unsigned long start_pfn,
-				unsigned long nr_pages, gfp_t gfp_mask)
-{
-	unsigned long end_pfn = start_pfn + nr_pages;
-	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
-				  gfp_mask);
-}
-
-static bool pfn_range_valid_gigantic(struct zone *z,
-			unsigned long start_pfn, unsigned long nr_pages)
-{
-	unsigned long i, end_pfn = start_pfn + nr_pages;
-	struct page *page;
-
-	for (i = start_pfn; i < end_pfn; i++) {
-		page = pfn_to_online_page(i);
-		if (!page)
-			return false;
-
-		if (page_zone(page) != z)
-			return false;
-
-		if (PageReserved(page))
-			return false;
-
-		if (page_count(page) > 0)
-			return false;
-
-		if (PageHuge(page))
-			return false;
-	}
-
-	return true;
-}
-
-static bool zone_spans_last_pfn(const struct zone *zone,
-			unsigned long start_pfn, unsigned long nr_pages)
-{
-	unsigned long last_pfn = start_pfn + nr_pages - 1;
-	return zone_spans_pfn(zone, last_pfn);
-}
-
 static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 		int nid, nodemask_t *nodemask)
 {
-	unsigned int order = huge_page_order(h);
-	unsigned long nr_pages = 1 << order;
-	unsigned long ret, pfn, flags;
-	struct zonelist *zonelist;
-	struct zone *zone;
-	struct zoneref *z;
-
-	zonelist = node_zonelist(nid, gfp_mask);
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nodemask) {
-		spin_lock_irqsave(&zone->lock, flags);
+	unsigned long nr_pages = 1UL << huge_page_order(h);
 
-		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
-		while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
-			if (pfn_range_valid_gigantic(zone, pfn, nr_pages)) {
-				/*
-				 * We release the zone lock here because
-				 * alloc_contig_range() will also lock the zone
-				 * at some point. If there's an allocation
-				 * spinning on this lock, it may win the race
-				 * and cause alloc_contig_range() to fail...
-				 */
-				spin_unlock_irqrestore(&zone->lock, flags);
-				ret = __alloc_gigantic_page(pfn, nr_pages, gfp_mask);
-				if (!ret)
-					return pfn_to_page(pfn);
-				spin_lock_irqsave(&zone->lock, flags);
-			}
-			pfn += nr_pages;
-		}
-
-		spin_unlock_irqrestore(&zone->lock, flags);
-	}
-
-	return NULL;
+	return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd1dd0712624..880bcbc5380d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8499,6 +8499,103 @@  int alloc_contig_range(unsigned long start, unsigned long end,
 				pfn_max_align_up(end), migratetype);
 	return ret;
 }
+
+static int __alloc_contig_pages(unsigned long start_pfn,
+				unsigned long nr_pages, gfp_t gfp_mask)
+{
+	unsigned long end_pfn = start_pfn + nr_pages;
+
+	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
+				  gfp_mask);
+}
+
+static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
+				   unsigned long nr_pages)
+{
+	unsigned long i, end_pfn = start_pfn + nr_pages;
+	struct page *page;
+
+	for (i = start_pfn; i < end_pfn; i++) {
+		page = pfn_to_online_page(i);
+		if (!page)
+			return false;
+
+		if (page_zone(page) != z)
+			return false;
+
+		if (PageReserved(page))
+			return false;
+
+		if (page_count(page) > 0)
+			return false;
+
+		if (PageHuge(page))
+			return false;
+	}
+	return true;
+}
+
+static bool zone_spans_last_pfn(const struct zone *zone,
+				unsigned long start_pfn, unsigned long nr_pages)
+{
+	unsigned long last_pfn = start_pfn + nr_pages - 1;
+
+	return zone_spans_pfn(zone, last_pfn);
+}
+
+/**
+ * alloc_contig_pages() -- tries to find and allocate contiguous range of pages
+ * @nr_pages:	Number of contiguous pages to allocate
+ * @gfp_mask:	GFP mask to limit search and used during compaction
+ * @nid:	Target node
+ * @nodemask:	Mask for other possible nodes
+ *
+ * This routine is a wrapper around alloc_contig_range(). It scans over zones
+ * on an applicable zonelist to find a contiguous pfn range which can then be
+ * tried for allocation with alloc_contig_range(). This routine is intended
+ * for allocation requests which can not be fulfilled with the buddy allocator.
+ *
+ * Allocated pages can be freed with free_contig_range() or by manually calling
+ * __free_page() on each allocated page.
+ *
+ * Return: pointer to contiguous pages on success, or NULL if not successful.
+ */
+struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
+				int nid, nodemask_t *nodemask)
+{
+	unsigned long ret, pfn, flags;
+	struct zonelist *zonelist;
+	struct zone *zone;
+	struct zoneref *z;
+
+	zonelist = node_zonelist(nid, gfp_mask);
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					gfp_zone(gfp_mask), nodemask) {
+		spin_lock_irqsave(&zone->lock, flags);
+
+		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
+		while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
+			if (pfn_range_valid_contig(zone, pfn, nr_pages)) {
+				/*
+				 * We release the zone lock here because
+				 * alloc_contig_range() will also lock the zone
+				 * at some point. If there's an allocation
+				 * spinning on this lock, it may win the race
+				 * and cause alloc_contig_range() to fail...
+				 */
+				spin_unlock_irqrestore(&zone->lock, flags);
+				ret = __alloc_contig_pages(pfn, nr_pages,
+							gfp_mask);
+				if (!ret)
+					return pfn_to_page(pfn);
+				spin_lock_irqsave(&zone->lock, flags);
+			}
+			pfn += nr_pages;
+		}
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+	return NULL;
+}
 #endif /* CONFIG_CONTIG_ALLOC */
 
 void free_contig_range(unsigned long pfn, unsigned int nr_pages)