Message ID | 20210208103812.32056-3-osalvador@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Make alloc_contig_range handle Hugetlb pages | expand |
On 08.02.21 11:38, Oscar Salvador wrote: > Free hugetlb pages are trickier to handle as to in order to guarantee > no userspace appplication disruption, we need to replace the > current free hugepage with a new one. > > In order to do that, a new function called alloc_and_dissolve_huge_page > in introduced. > This function will first try to get a new fresh hugetlb page, and if it > succeeds, it will dissolve the old one. > Thanks for looking into this! Can we move this patch to #1 in the series? It is the easier case. I also wonder if we should at least try on the memory unplug path to keep nr_pages by at least trying to allocate at new one if required, and printing a warning if that fails (after all, we're messing with something configured by the admin - "nr_pages"). Note that gigantic pages are special (below). > Signed-off-by: Oscar Salvador <osalvador@suse.de> > --- > include/linux/hugetlb.h | 6 ++++++ > mm/compaction.c | 11 +++++++++++ > mm/hugetlb.c | 35 +++++++++++++++++++++++++++++++++++ > 3 files changed, 52 insertions(+) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index ebca2ef02212..f81afcb86e89 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -505,6 +505,7 @@ struct huge_bootmem_page { > struct hstate *hstate; > }; > > +bool alloc_and_dissolve_huge_page(struct page *page); > struct page *alloc_huge_page(struct vm_area_struct *vma, > unsigned long addr, int avoid_reserve); > struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, > @@ -773,6 +774,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, > #else /* CONFIG_HUGETLB_PAGE */ > struct hstate {}; > > +static inline bool alloc_and_dissolve_huge_page(struct page *page) > +{ > + return false; > +} > + > static inline struct page *alloc_huge_page(struct vm_area_struct *vma, > unsigned long addr, > int avoid_reserve) > diff --git a/mm/compaction.c b/mm/compaction.c > index 89cd2e60da29..7969ddc10856 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -952,6 +952,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, > low_pfn += compound_nr(page) - 1; > goto isolate_success_no_list; > } > + } else { } else if (alloc_and_dissolve_huge_page(page))) { ... > + /* > + * Free hugetlb page. Allocate a new one and > + * dissolve this is if succeed. > + */ > + if (alloc_and_dissolve_huge_page(page)) { > + unsigned long order = buddy_order_unsafe(page); > + > + low_pfn += (1UL << order) - 1; > + continue; > + } Note that there is a very ugly corner case we will have to handle gracefully (I think also in patch #1): Assume you allocated a gigantic page (and assume that we are not using CMA for gigantic pages for simplicity). Assume you want to allocate another one. alloc_pool_huge_page()->...->alloc_contig_pages() will stumble over the first allocated page. It will try to alloc_and_dissolve_huge_page() the existing gigantic page. To do that, it will alloc_pool_huge_page()->...->alloc_contig_pages() ... and so on. Bad. We really don't want to mess with gigantic pages (migrate, dissolve) while allocating a gigantic page. I think the easiest (and cleanest) way forward is to not mess (isolate, migrate, dissolve) with gigantic pages at all. Gigantic pages are not movable, so they won't be placed on random CMA / ZONE_MOVABLE. Some hstate_is_gigantic(h) calls (maybe inside alloc_and_dissolve_huge_page() ? ) along with a nice comment might be good enough to avoid having to pass down some kind of alloc_contig context. I even think that should be handled inside (the main issue is that in contrast to CMA, plain alloc_contig_pages() has no memory about which parts were allocated and will simply try re-allocating what it previously allocated and never freed - which is usually fine, unless we're dealing with such special cases) Apart from that, not messing with gigantic pages feels like the right approach (allocating/migrating gigantic pages is just horribly slow and most probably not worth it anyway).
On Wed, Feb 10, 2021 at 09:23:59AM +0100, David Hildenbrand wrote: > On 08.02.21 11:38, Oscar Salvador wrote: > > Free hugetlb pages are trickier to handle as to in order to guarantee > > no userspace appplication disruption, we need to replace the > > current free hugepage with a new one. > > > > In order to do that, a new function called alloc_and_dissolve_huge_page > > in introduced. > > This function will first try to get a new fresh hugetlb page, and if it > > succeeds, it will dissolve the old one. > > > > Thanks for looking into this! Can we move this patch to #1 in the series? It > is the easier case. > > I also wonder if we should at least try on the memory unplug path to keep > nr_pages by at least trying to allocate at new one if required, and printing > a warning if that fails (after all, we're messing with something configured > by the admin - "nr_pages"). Note that gigantic pages are special (below). So, do you mean to allocate a new fresh hugepage in case we have a free hugetlb page within the range we are trying to offline? That makes some sense I guess. I can have a look at that, and make hotplug code use the new alloc_and_dissolve(). Thanks for bringing this up, it is somsething I did not think about. > > + /* > > + * Free hugetlb page. Allocate a new one and > > + * dissolve this is if succeed. > > + */ > > + if (alloc_and_dissolve_huge_page(page)) { > > + unsigned long order = buddy_order_unsafe(page); > > + > > + low_pfn += (1UL << order) - 1; > > + continue; > > + } > > > > Note that there is a very ugly corner case we will have to handle gracefully > (I think also in patch #1): > > Assume you allocated a gigantic page (and assume that we are not using CMA > for gigantic pages for simplicity). Assume you want to allocate another one. > alloc_pool_huge_page()->...->alloc_contig_pages() will stumble over the > first allocated page. It will try to alloc_and_dissolve_huge_page() the > existing gigantic page. To do that, it will > alloc_pool_huge_page()->...->alloc_contig_pages() ... and so on. Bad. Heh, I was too naive. I have to confess I completely forgot about gigantic pages and this cyclic dependency. > We really don't want to mess with gigantic pages (migrate, dissolve) while > allocating a gigantic page. I think the easiest (and cleanest) way forward > is to not mess (isolate, migrate, dissolve) with gigantic pages at all. > > Gigantic pages are not movable, so they won't be placed on random CMA / > ZONE_MOVABLE. > > Some hstate_is_gigantic(h) calls (maybe inside > alloc_and_dissolve_huge_page() ? ) along with a nice comment might be good > enough to avoid having to pass down some kind of alloc_contig context. I > even think that should be handled inside > > (the main issue is that in contrast to CMA, plain alloc_contig_pages() has > no memory about which parts were allocated and will simply try re-allocating > what it previously allocated and never freed - which is usually fine, unless > we're dealing with such special cases) > > Apart from that, not messing with gigantic pages feels like the right > approach (allocating/migrating gigantic pages is just horribly slow and most > probably not worth it anyway). Yes, I also agree that we should leave out gigantic pages, at least for now. We might make it work in the future but I cannot come up with a fancy way to work around that right now, so it makes sense to cut down the complexity here. Thanks David for the insight!
On 10.02.21 15:24, Oscar Salvador wrote: > On Wed, Feb 10, 2021 at 09:23:59AM +0100, David Hildenbrand wrote: >> On 08.02.21 11:38, Oscar Salvador wrote: >>> Free hugetlb pages are trickier to handle as to in order to guarantee >>> no userspace appplication disruption, we need to replace the >>> current free hugepage with a new one. >>> >>> In order to do that, a new function called alloc_and_dissolve_huge_page >>> in introduced. >>> This function will first try to get a new fresh hugetlb page, and if it >>> succeeds, it will dissolve the old one. >>> >> >> Thanks for looking into this! Can we move this patch to #1 in the series? It >> is the easier case. >> >> I also wonder if we should at least try on the memory unplug path to keep >> nr_pages by at least trying to allocate at new one if required, and printing >> a warning if that fails (after all, we're messing with something configured >> by the admin - "nr_pages"). Note that gigantic pages are special (below). > > So, do you mean to allocate a new fresh hugepage in case we have a free > hugetlb page within the range we are trying to offline? That makes some > sense I guess. > > I can have a look at that, and make hotplug code use the new > alloc_and_dissolve(). Yes, with the difference that hotplug code most probably wants to continue even if allocation failed (printing a warning) - mimix existing behavior. For alloc_contig, I'd say, fail if we cannot "relocate free huge pages that are still required to no modify nr_pages". alloc_and_dissolve() should only allocate a page if really required (e.g., not sure if we could skip allocation in some cases - like with surplus pages, needs some investigation), such that the admin-configured nr_pages stays unchanged.
On 2/8/21 2:38 AM, Oscar Salvador wrote: > Free hugetlb pages are trickier to handle as to in order to guarantee > no userspace appplication disruption, we need to replace the > current free hugepage with a new one. > > In order to do that, a new function called alloc_and_dissolve_huge_page > in introduced. > This function will first try to get a new fresh hugetlb page, and if it > succeeds, it will dissolve the old one. > > Signed-off-by: Oscar Salvador <osalvador@suse.de> > --- > include/linux/hugetlb.h | 6 ++++++ > mm/compaction.c | 11 +++++++++++ > mm/hugetlb.c | 35 +++++++++++++++++++++++++++++++++++ > 3 files changed, 52 insertions(+) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index ebca2ef02212..f81afcb86e89 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -505,6 +505,7 @@ struct huge_bootmem_page { > struct hstate *hstate; > }; > > +bool alloc_and_dissolve_huge_page(struct page *page); > struct page *alloc_huge_page(struct vm_area_struct *vma, > unsigned long addr, int avoid_reserve); > struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, > @@ -773,6 +774,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, > #else /* CONFIG_HUGETLB_PAGE */ > struct hstate {}; > > +static inline bool alloc_and_dissolve_huge_page(struct page *page) > +{ > + return false; > +} > + > static inline struct page *alloc_huge_page(struct vm_area_struct *vma, > unsigned long addr, > int avoid_reserve) > diff --git a/mm/compaction.c b/mm/compaction.c > index 89cd2e60da29..7969ddc10856 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -952,6 +952,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, > low_pfn += compound_nr(page) - 1; > goto isolate_success_no_list; > } > + } else { > + /* > + * Free hugetlb page. Allocate a new one and > + * dissolve this is if succeed. > + */ > + if (alloc_and_dissolve_huge_page(page)) { > + unsigned long order = buddy_order_unsafe(page); > + > + low_pfn += (1UL << order) - 1; > + continue; > + } > } > goto isolate_fail; > } > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 18f6ee317900..79ffbb64c4ee 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2253,6 +2253,41 @@ static void restore_reserve_on_error(struct hstate *h, > } > } > > +bool alloc_and_dissolve_huge_page(struct page *page) > +{ > + NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL); > + struct page *head; > + struct hstate *h; > + bool ret = false; > + int nid; > + > + if (!nodes_allowed) > + return ret; > + > + spin_lock(&hugetlb_lock); > + head = compound_head(page); > + h = page_hstate(head); > + nid = page_to_nid(head); > + spin_unlock(&hugetlb_lock); > + > + init_nodemask_of_node(nodes_allowed, nid); > + > + /* > + * Before dissolving the page, we need to allocate a new one, > + * so the pool remains stable. > + */ > + if (alloc_pool_huge_page(h, nodes_allowed, NULL)) { > + /* > + * Ok, we have a free hugetlb-page to replace this > + * one. Dissolve the old page. > + */ > + if (!dissolve_free_huge_page(page)) > + ret = true; Should probably check for -EBUSY as this means someone started using the page while we were allocating a new one. It would complicate the code to try and do the 'right thing'. Right thing might be getting dissolving the new pool page and then trying to isolate this in use page. Of course things could change again while you are doing that. :( Might be good enough as is. As noted previously, this code is racy.
On Wed, Feb 10, 2021 at 05:16:29PM -0800, Mike Kravetz wrote: > Should probably check for -EBUSY as this means someone started using > the page while we were allocating a new one. It would complicate the > code to try and do the 'right thing'. Right thing might be getting > dissolving the new pool page and then trying to isolate this in use > page. Of course things could change again while you are doing that. :( Yeah, I kept the error handling rather low just be clear about the approach I was leaning towards, but yes, we should definitely check for -EBUSY on dissolve_free_huge_page(). And it might be that dissolve_free_huge_page() returns -EBUSY on the old page, and we need to dissolve the freshly allocated one as it is not going to be used, and that might fail as well due to reserves for instance, or maybe someone started using it? I have to confess that I need to check the reservation code closer to be aware of corner cases. We used to try to be clever in such situations in memory-failure code, but at some point you end up thinking "ok, how many retries are considered enough?". That code was trickier as we were handling uncorrected/corrected memory errors, so we strived to do our best, but here we can be more flexible as the whole thing is racy per se, and just fail if we find too many obstacles. I shall resume work early next week. Thanks for the tips ;-)
On 2/10/21 12:23 AM, David Hildenbrand wrote: > On 08.02.21 11:38, Oscar Salvador wrote: >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -952,6 +952,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, >> low_pfn += compound_nr(page) - 1; >> goto isolate_success_no_list; >> } >> + } else { > > } else if (alloc_and_dissolve_huge_page(page))) { > > ... > >> + /* >> + * Free hugetlb page. Allocate a new one and >> + * dissolve this is if succeed. >> + */ >> + if (alloc_and_dissolve_huge_page(page)) { >> + unsigned long order = buddy_order_unsafe(page); >> + >> + low_pfn += (1UL << order) - 1; >> + continue; >> + } > > > > Note that there is a very ugly corner case we will have to handle gracefully (I think also in patch #1): > > Assume you allocated a gigantic page (and assume that we are not using CMA for gigantic pages for simplicity). Assume you want to allocate another one. alloc_pool_huge_page()->...->alloc_contig_pages() will stumble over the first allocated page. It will try to alloc_and_dissolve_huge_page() the existing gigantic page. To do that, it will alloc_pool_huge_page()->...->alloc_contig_pages() ... and so on. Bad. > Sorry for resurrecting an old thread. While looking at V3 of these patches, I was exploring all the calling sequences looking for races and other issues. It 'may' be that the issue about infinitely allocating and freeing gigantic pages may not be an issue. Of course, I could be mistaken. Here is my reasoning: alloc_and_dissolve_huge_page (now isolate_or_dissolve_huge_page) will be called from __alloc_contig_migrate_range() within alloc_contig_range(). Before calling __alloc_contig_migrate_range, we call start_isolate_page_range to isolate all page blocks in the range. Because all the page blocks in the range are isolated, another invocation of alloc_contig_range will not operate on any part of that range. See the comments for start_isolate_page_range or commit 2c7452a075d4. So, when start_isolate_page_range goes to allocate another gigantic page it will never notice/operate on the existing gigantic page. Again, this is confusing and I might be missing something. In any case, I agree that gigantic pages are tricky and we should leave them out of the discussion for now. We can rethink this later if necessary.
> Am 25.02.2021 um 22:43 schrieb Mike Kravetz <mike.kravetz@oracle.com>: > > On 2/10/21 12:23 AM, David Hildenbrand wrote: >>> On 08.02.21 11:38, Oscar Salvador wrote: >>> --- a/mm/compaction.c >>> +++ b/mm/compaction.c >>> @@ -952,6 +952,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, >>> low_pfn += compound_nr(page) - 1; >>> goto isolate_success_no_list; >>> } >>> + } else { >> >> } else if (alloc_and_dissolve_huge_page(page))) { >> >> ... >> >>> + /* >>> + * Free hugetlb page. Allocate a new one and >>> + * dissolve this is if succeed. >>> + */ >>> + if (alloc_and_dissolve_huge_page(page)) { >>> + unsigned long order = buddy_order_unsafe(page); >>> + >>> + low_pfn += (1UL << order) - 1; >>> + continue; >>> + } >> >> >> >> Note that there is a very ugly corner case we will have to handle gracefully (I think also in patch #1): >> >> Assume you allocated a gigantic page (and assume that we are not using CMA for gigantic pages for simplicity). Assume you want to allocate another one. alloc_pool_huge_page()->...->alloc_contig_pages() will stumble over the first allocated page. It will try to alloc_and_dissolve_huge_page() the existing gigantic page. To do that, it will alloc_pool_huge_page()->...->alloc_contig_pages() ... and so on. Bad. >> > > Sorry for resurrecting an old thread. > While looking at V3 of these patches, I was exploring all the calling > sequences looking for races and other issues. It 'may' be that the > issue about infinitely allocating and freeing gigantic pages may not be > an issue. Of course, I could be mistaken. Here is my reasoning: > > alloc_and_dissolve_huge_page (now isolate_or_dissolve_huge_page) will be > called from __alloc_contig_migrate_range() within alloc_contig_range(). > Before calling __alloc_contig_migrate_range, we call start_isolate_page_range > to isolate all page blocks in the range. Because all the page blocks in > the range are isolated, another invocation of alloc_contig_range will > not operate on any part of that range. See the comments for > start_isolate_page_range or commit 2c7452a075d4. So, when > start_isolate_page_range goes to allocate another gigantic page it will > never notice/operate on the existing gigantic page. > > Again, this is confusing and I might be missing something. I think you are right that the endless loop is blocked. But I think the whole thing could cascade once we have multiple gigantic pages allocated. Try allocating a new gpage. We find an existing gpage, isolate it and try to migrate it. To do that, we try allocating a new gpage. We find yet another existing gpage, isolate and try to migrate it ... until we isolated all gpages on out way to an actual usable area. Then we have to actually migrate all these in reverse order ... Of course this only works if we can actually isolate a gigantic page - which should be the case I think (they are migratable and should be marked as movable). > > In any case, I agree that gigantic pages are tricky and we should leave > them out of the discussion for now. We can rethink this later if > necessary. Yes, it‘s tricky and not strictly required right now because we never place them on ZONE_MOVABLE. And as I said, actual use cases might be rare.
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index ebca2ef02212..f81afcb86e89 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -505,6 +505,7 @@ struct huge_bootmem_page { struct hstate *hstate; }; +bool alloc_and_dissolve_huge_page(struct page *page); struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, @@ -773,6 +774,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; +static inline bool alloc_and_dissolve_huge_page(struct page *page) +{ + return false; +} + static inline struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) diff --git a/mm/compaction.c b/mm/compaction.c index 89cd2e60da29..7969ddc10856 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -952,6 +952,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, low_pfn += compound_nr(page) - 1; goto isolate_success_no_list; } + } else { + /* + * Free hugetlb page. Allocate a new one and + * dissolve this is if succeed. + */ + if (alloc_and_dissolve_huge_page(page)) { + unsigned long order = buddy_order_unsafe(page); + + low_pfn += (1UL << order) - 1; + continue; + } } goto isolate_fail; } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 18f6ee317900..79ffbb64c4ee 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2253,6 +2253,41 @@ static void restore_reserve_on_error(struct hstate *h, } } +bool alloc_and_dissolve_huge_page(struct page *page) +{ + NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL); + struct page *head; + struct hstate *h; + bool ret = false; + int nid; + + if (!nodes_allowed) + return ret; + + spin_lock(&hugetlb_lock); + head = compound_head(page); + h = page_hstate(head); + nid = page_to_nid(head); + spin_unlock(&hugetlb_lock); + + init_nodemask_of_node(nodes_allowed, nid); + + /* + * Before dissolving the page, we need to allocate a new one, + * so the pool remains stable. + */ + if (alloc_pool_huge_page(h, nodes_allowed, NULL)) { + /* + * Ok, we have a free hugetlb-page to replace this + * one. Dissolve the old page. + */ + if (!dissolve_free_huge_page(page)) + ret = true; + } + + return ret; +} + struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) {
Free hugetlb pages are trickier to handle as to in order to guarantee no userspace appplication disruption, we need to replace the current free hugepage with a new one. In order to do that, a new function called alloc_and_dissolve_huge_page in introduced. This function will first try to get a new fresh hugetlb page, and if it succeeds, it will dissolve the old one. Signed-off-by: Oscar Salvador <osalvador@suse.de> --- include/linux/hugetlb.h | 6 ++++++ mm/compaction.c | 11 +++++++++++ mm/hugetlb.c | 35 +++++++++++++++++++++++++++++++++++ 3 files changed, 52 insertions(+)