Message ID | 20210217184926.33567-1-mike.kravetz@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] hugetlb: fix update_and_free_page contig page struct assumption | expand |
On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: > page structs are not guaranteed to be contiguous for gigantic pages. The > routine update_and_free_page can encounter a gigantic page, yet it assumes > page structs are contiguous when setting page flags in subpages. > > If update_and_free_page encounters non-contiguous page structs, we can > see “BUG: Bad page state in process …” errors. > > Non-contiguous page structs are generally not an issue. However, they can > exist with a specific kernel configuration and hotplug operations. For > example: Configure the kernel with CONFIG_SPARSEMEM and > !CONFIG_SPARSEMEM_VMEMMAP. Then, hotplug add memory for the area where the > gigantic page will be allocated. > Zi Yan outlined steps to reproduce here [1]. > > [1] https://lore.kernel.org/linux-mm/16F7C58B-4D79-41C5-9B64-A1A1628F4AF2@nvidia.com/ > > Fixes: 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime") June 2014. That's a long lurk time for a bug. I wonder if some later commit revealed it. I guess it doesn't matter a lot, but some -stable kernel maintainers might wonder if they really need this fix...
On 2/17/21 11:02 AM, Andrew Morton wrote: > On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: > >> page structs are not guaranteed to be contiguous for gigantic pages. The >> routine update_and_free_page can encounter a gigantic page, yet it assumes >> page structs are contiguous when setting page flags in subpages. >> >> If update_and_free_page encounters non-contiguous page structs, we can >> see “BUG: Bad page state in process …” errors. >> >> Non-contiguous page structs are generally not an issue. However, they can >> exist with a specific kernel configuration and hotplug operations. For >> example: Configure the kernel with CONFIG_SPARSEMEM and >> !CONFIG_SPARSEMEM_VMEMMAP. Then, hotplug add memory for the area where the >> gigantic page will be allocated. >> Zi Yan outlined steps to reproduce here [1]. >> >> [1] https://lore.kernel.org/linux-mm/16F7C58B-4D79-41C5-9B64-A1A1628F4AF2@nvidia.com/ >> >> Fixes: 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime") > > June 2014. That's a long lurk time for a bug. I wonder if some later > commit revealed it. > > I guess it doesn't matter a lot, but some -stable kernel maintainers > might wonder if they really need this fix... I am not sure how common a CONFIG_SPARSEMEM and !CONFIG_SPARSEMEM_VMEMMAP config is. On the more popular architectures, this is not the default. But, you can build a kernel with such options. And, then you need to hotplug memory add and allocate a gigantic page there. It is unlikely to happen, but possible since Zi could force the BUG. The copy_huge_page_from_user bug requires the same non-normal configuration and is just as unlikely to occurr. But, since it can overwrite somewhat random pages I would feel better if it was fixed.
On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote: > On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: > > page structs are not guaranteed to be contiguous for gigantic pages. The > > June 2014. That's a long lurk time for a bug. I wonder if some later > commit revealed it. I would suggest that gigantic pages have not seen much use. Certainly performance with Intel CPUs on benchmarks that I've been involved with showed lower performance with 1GB pages than with 2MB pages until quite recently.
On Thu, Feb 18, 2021 at 02:45:54PM +0000, Matthew Wilcox wrote: > On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote: > > On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > page structs are not guaranteed to be contiguous for gigantic pages. The > > > > June 2014. That's a long lurk time for a bug. I wonder if some later > > commit revealed it. > > I would suggest that gigantic pages have not seen much use. Certainly > performance with Intel CPUs on benchmarks that I've been involved with > showed lower performance with 1GB pages than with 2MB pages until quite > recently. I suggested in another thread that maybe it is time to consider dropping this "feature" If it has been slightly broken for 7 years it seems a good bet it isn't actually being used. The cost to fix GUP to be compatible with this will hurt normal GUP performance - and again, that nobody has hit this bug in GUP further suggests the feature isn't used.. Jason
On 18 Feb 2021, at 12:25, Jason Gunthorpe wrote: > On Thu, Feb 18, 2021 at 02:45:54PM +0000, Matthew Wilcox wrote: >> On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote: >>> On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: >>>> page structs are not guaranteed to be contiguous for gigantic pages. The >>> >>> June 2014. That's a long lurk time for a bug. I wonder if some later >>> commit revealed it. >> >> I would suggest that gigantic pages have not seen much use. Certainly >> performance with Intel CPUs on benchmarks that I've been involved with >> showed lower performance with 1GB pages than with 2MB pages until quite >> recently. > > I suggested in another thread that maybe it is time to consider > dropping this "feature" You mean dropping gigantic page support in hugetlb? > > If it has been slightly broken for 7 years it seems a good bet it > isn't actually being used. > > The cost to fix GUP to be compatible with this will hurt normal > GUP performance - and again, that nobody has hit this bug in GUP > further suggests the feature isn't used.. A easy fix might be to make gigantic hugetlb page depends on CONFIG_SPARSEMEM_VMEMMAP, which guarantee all struct pages are contiguous. — Best Regards, Yan Zi
On Thu, Feb 18, 2021 at 12:27:58PM -0500, Zi Yan wrote: > On 18 Feb 2021, at 12:25, Jason Gunthorpe wrote: > > > On Thu, Feb 18, 2021 at 02:45:54PM +0000, Matthew Wilcox wrote: > >> On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote: > >>> On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: > >>>> page structs are not guaranteed to be contiguous for gigantic pages. The > >>> > >>> June 2014. That's a long lurk time for a bug. I wonder if some later > >>> commit revealed it. > >> > >> I would suggest that gigantic pages have not seen much use. Certainly > >> performance with Intel CPUs on benchmarks that I've been involved with > >> showed lower performance with 1GB pages than with 2MB pages until quite > >> recently. > > > > I suggested in another thread that maybe it is time to consider > > dropping this "feature" > > You mean dropping gigantic page support in hugetlb? No, I mean dropping support for arches that want to do: tail_page != head_page + tail_page_nr because they can't allocate the required page array either virtually or physically contiguously. It seems like quite a burden on the core mm for a very niche, and maybe even non-existant, case. It was originally done for PPC, can these PPC systems use VMEMMAP now? > > The cost to fix GUP to be compatible with this will hurt normal > > GUP performance - and again, that nobody has hit this bug in GUP > > further suggests the feature isn't used.. > > A easy fix might be to make gigantic hugetlb page depends on > CONFIG_SPARSEMEM_VMEMMAP, which guarantee all struct pages are contiguous. Yes, exactly. Jason
On 2/18/21 9:25 AM, Jason Gunthorpe wrote: > On Thu, Feb 18, 2021 at 02:45:54PM +0000, Matthew Wilcox wrote: >> On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote: >>> On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: >>>> page structs are not guaranteed to be contiguous for gigantic pages. The >>> >>> June 2014. That's a long lurk time for a bug. I wonder if some later >>> commit revealed it. >> >> I would suggest that gigantic pages have not seen much use. Certainly >> performance with Intel CPUs on benchmarks that I've been involved with >> showed lower performance with 1GB pages than with 2MB pages until quite >> recently. > > I suggested in another thread that maybe it is time to consider > dropping this "feature" > > If it has been slightly broken for 7 years it seems a good bet it > isn't actually being used. > > The cost to fix GUP to be compatible with this will hurt normal > GUP performance - and again, that nobody has hit this bug in GUP > further suggests the feature isn't used.. I was thinking that we could detect these 'unusual' configurations and only do the slower page struct walking in those cases. However, we would need to do some research to make sure we have taken into account all possible config options which can produce non-contiguous page structs. That should have zero performance impact in the 'normal' cases. I suppose we could prohibit gigantic pages in these 'unusual' configurations. It would require some research to see if this 'may' impact someone.
On 18 Feb 2021, at 12:32, Jason Gunthorpe wrote: > On Thu, Feb 18, 2021 at 12:27:58PM -0500, Zi Yan wrote: >> On 18 Feb 2021, at 12:25, Jason Gunthorpe wrote: >> >>> On Thu, Feb 18, 2021 at 02:45:54PM +0000, Matthew Wilcox wrote: >>>> On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote: >>>>> On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: >>>>>> page structs are not guaranteed to be contiguous for gigantic pages. The >>>>> >>>>> June 2014. That's a long lurk time for a bug. I wonder if some later >>>>> commit revealed it. >>>> >>>> I would suggest that gigantic pages have not seen much use. Certainly >>>> performance with Intel CPUs on benchmarks that I've been involved with >>>> showed lower performance with 1GB pages than with 2MB pages until quite >>>> recently. >>> >>> I suggested in another thread that maybe it is time to consider >>> dropping this "feature" >> >> You mean dropping gigantic page support in hugetlb? > > No, I mean dropping support for arches that want to do: > > tail_page != head_page + tail_page_nr > > because they can't allocate the required page array either virtually > or physically contiguously. > > It seems like quite a burden on the core mm for a very niche, and > maybe even non-existant, case. > > It was originally done for PPC, can these PPC systems use VMEMMAP now? > >>> The cost to fix GUP to be compatible with this will hurt normal >>> GUP performance - and again, that nobody has hit this bug in GUP >>> further suggests the feature isn't used.. >> >> A easy fix might be to make gigantic hugetlb page depends on >> CONFIG_SPARSEMEM_VMEMMAP, which guarantee all struct pages are contiguous. > > Yes, exactly. I actually have a question on CONFIG_SPARSEMEM_VMEMMAP. Can we assume PFN_A - PFN_B == struct_page_A - struct_page_B, meaning all struct pages are ordered based on physical addresses? I just wonder for two PFN ranges, e.g., [0 - 128MB], [128MB - 256MB], if it is possible to first online [128MB - 256MB] then [0 - 128MB] and the struct pages of [128MB - 256MB] are in front of [0 - 128MB] in the vmemmap due to online ordering. — Best Regards, Yan Zi
On 2/18/21 9:40 AM, Zi Yan wrote: > On 18 Feb 2021, at 12:32, Jason Gunthorpe wrote: > >> On Thu, Feb 18, 2021 at 12:27:58PM -0500, Zi Yan wrote: >>> On 18 Feb 2021, at 12:25, Jason Gunthorpe wrote: >>> >>>> On Thu, Feb 18, 2021 at 02:45:54PM +0000, Matthew Wilcox wrote: >>>>> On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote: >>>>>> On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: >>>>>>> page structs are not guaranteed to be contiguous for gigantic pages. The >>>>>> >>>>>> June 2014. That's a long lurk time for a bug. I wonder if some later >>>>>> commit revealed it. >>>>> >>>>> I would suggest that gigantic pages have not seen much use. Certainly >>>>> performance with Intel CPUs on benchmarks that I've been involved with >>>>> showed lower performance with 1GB pages than with 2MB pages until quite >>>>> recently. >>>> >>>> I suggested in another thread that maybe it is time to consider >>>> dropping this "feature" >>> >>> You mean dropping gigantic page support in hugetlb? >> >> No, I mean dropping support for arches that want to do: >> >> tail_page != head_page + tail_page_nr >> >> because they can't allocate the required page array either virtually >> or physically contiguously. >> >> It seems like quite a burden on the core mm for a very niche, and >> maybe even non-existant, case. >> >> It was originally done for PPC, can these PPC systems use VMEMMAP now? >> >>>> The cost to fix GUP to be compatible with this will hurt normal >>>> GUP performance - and again, that nobody has hit this bug in GUP >>>> further suggests the feature isn't used.. >>> >>> A easy fix might be to make gigantic hugetlb page depends on >>> CONFIG_SPARSEMEM_VMEMMAP, which guarantee all struct pages are contiguous. >> >> Yes, exactly. > > I actually have a question on CONFIG_SPARSEMEM_VMEMMAP. Can we assume > PFN_A - PFN_B == struct_page_A - struct_page_B, meaning all struct pages > are ordered based on physical addresses? I just wonder for two PFN ranges, > e.g., [0 - 128MB], [128MB - 256MB], if it is possible to first online > [128MB - 256MB] then [0 - 128MB] and the struct pages of [128MB - 256MB] > are in front of [0 - 128MB] in the vmemmap due to online ordering. I have not looked at the code which does the onlining and vmemmap setup. But, these definitions make me believe it is true: #elif defined(CONFIG_SPARSEMEM_VMEMMAP) /* memmap is virtually contiguous. */ #define __pfn_to_page(pfn) (vmemmap + (pfn)) #define __page_to_pfn(page) (unsigned long)((page) - vmemmap)
On 18 Feb 2021, at 12:51, Mike Kravetz wrote: > On 2/18/21 9:40 AM, Zi Yan wrote: >> On 18 Feb 2021, at 12:32, Jason Gunthorpe wrote: >> >>> On Thu, Feb 18, 2021 at 12:27:58PM -0500, Zi Yan wrote: >>>> On 18 Feb 2021, at 12:25, Jason Gunthorpe wrote: >>>> >>>>> On Thu, Feb 18, 2021 at 02:45:54PM +0000, Matthew Wilcox wrote: >>>>>> On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote: >>>>>>> On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: >>>>>>>> page structs are not guaranteed to be contiguous for gigantic pages. The >>>>>>> >>>>>>> June 2014. That's a long lurk time for a bug. I wonder if some later >>>>>>> commit revealed it. >>>>>> >>>>>> I would suggest that gigantic pages have not seen much use. Certainly >>>>>> performance with Intel CPUs on benchmarks that I've been involved with >>>>>> showed lower performance with 1GB pages than with 2MB pages until quite >>>>>> recently. >>>>> >>>>> I suggested in another thread that maybe it is time to consider >>>>> dropping this "feature" >>>> >>>> You mean dropping gigantic page support in hugetlb? >>> >>> No, I mean dropping support for arches that want to do: >>> >>> tail_page != head_page + tail_page_nr >>> >>> because they can't allocate the required page array either virtually >>> or physically contiguously. >>> >>> It seems like quite a burden on the core mm for a very niche, and >>> maybe even non-existant, case. >>> >>> It was originally done for PPC, can these PPC systems use VMEMMAP now? >>> >>>>> The cost to fix GUP to be compatible with this will hurt normal >>>>> GUP performance - and again, that nobody has hit this bug in GUP >>>>> further suggests the feature isn't used.. >>>> >>>> A easy fix might be to make gigantic hugetlb page depends on >>>> CONFIG_SPARSEMEM_VMEMMAP, which guarantee all struct pages are contiguous. >>> >>> Yes, exactly. >> >> I actually have a question on CONFIG_SPARSEMEM_VMEMMAP. Can we assume >> PFN_A - PFN_B == struct_page_A - struct_page_B, meaning all struct pages >> are ordered based on physical addresses? I just wonder for two PFN ranges, >> e.g., [0 - 128MB], [128MB - 256MB], if it is possible to first online >> [128MB - 256MB] then [0 - 128MB] and the struct pages of [128MB - 256MB] >> are in front of [0 - 128MB] in the vmemmap due to online ordering. > > I have not looked at the code which does the onlining and vmemmap setup. > But, these definitions make me believe it is true: > > #elif defined(CONFIG_SPARSEMEM_VMEMMAP) > > /* memmap is virtually contiguous. */ > #define __pfn_to_page(pfn) (vmemmap + (pfn)) > #define __page_to_pfn(page) (unsigned long)((page) - vmemmap) Makes sense. Thank you for checking. I guess making gigantic page depends on CONFIG_SPARSEMEM_VMEMMAP might be a good way of simplifying code and avoiding future bugs unless there is an arch really needs gigantic page and cannot have VMEMMAP. — Best Regards, Yan Zi
On 2/18/21 9:34 AM, Mike Kravetz wrote: > On 2/18/21 9:25 AM, Jason Gunthorpe wrote: >> On Thu, Feb 18, 2021 at 02:45:54PM +0000, Matthew Wilcox wrote: >>> On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote: >>>> On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote: >>>>> page structs are not guaranteed to be contiguous for gigantic pages. The >>>> >>>> June 2014. That's a long lurk time for a bug. I wonder if some later >>>> commit revealed it. >>> >>> I would suggest that gigantic pages have not seen much use. Certainly >>> performance with Intel CPUs on benchmarks that I've been involved with >>> showed lower performance with 1GB pages than with 2MB pages until quite >>> recently. >> >> I suggested in another thread that maybe it is time to consider >> dropping this "feature" >> >> If it has been slightly broken for 7 years it seems a good bet it >> isn't actually being used. >> >> The cost to fix GUP to be compatible with this will hurt normal >> GUP performance - and again, that nobody has hit this bug in GUP >> further suggests the feature isn't used.. > > I was thinking that we could detect these 'unusual' configurations and only > do the slower page struct walking in those cases. However, we would need to > do some research to make sure we have taken into account all possible config > options which can produce non-contiguous page structs. That should have zero > performance impact in the 'normal' cases. What about something like the following patch, and making all code that wants to scan gigantic page subpages use mem_map_next()? From 95b0384bd5d7f0435546bdd3c01c478724ae0166 Mon Sep 17 00:00:00 2001 From: Mike Kravetz <mike.kravetz@oracle.com> Date: Thu, 18 Feb 2021 13:35:02 -0800 Subject: [PATCH] mm: define PFN_PAGE_MAP_LINEAR to optimize gigantic page scans Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- arch/ia64/include/asm/page.h | 1 + arch/m68k/include/asm/page_no.h | 1 + include/asm-generic/memory_model.h | 2 ++ mm/internal.h | 2 ++ 4 files changed, 6 insertions(+) diff --git a/arch/ia64/include/asm/page.h b/arch/ia64/include/asm/page.h index b69a5499d75b..8f4288862ec8 100644 --- a/arch/ia64/include/asm/page.h +++ b/arch/ia64/include/asm/page.h @@ -106,6 +106,7 @@ extern struct page *vmem_map; #ifdef CONFIG_DISCONTIGMEM # define page_to_pfn(page) ((unsigned long) (page - vmem_map)) # define pfn_to_page(pfn) (vmem_map + (pfn)) +# define PFN_PAGE_MAP_LINEAR # define __pfn_to_phys(pfn) PFN_PHYS(pfn) #else # include <asm-generic/memory_model.h> diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h index 6bbe52025de3..cafc0731a42c 100644 --- a/arch/m68k/include/asm/page_no.h +++ b/arch/m68k/include/asm/page_no.h @@ -28,6 +28,7 @@ extern unsigned long memory_end; #define pfn_to_page(pfn) virt_to_page(pfn_to_virt(pfn)) #define page_to_pfn(page) virt_to_pfn(page_to_virt(page)) +#define PFN_PAGE_MAP_LINEAR #define pfn_valid(pfn) ((pfn) < max_mapnr) #define virt_addr_valid(kaddr) (((void *)(kaddr) >= (void *)PAGE_OFFSET) && \ diff --git a/include/asm-generic/memory_model.h b/include/asm-generic/memory_model.h index 7637fb46ba4f..8ac4c48dbf22 100644 --- a/include/asm-generic/memory_model.h +++ b/include/asm-generic/memory_model.h @@ -33,6 +33,7 @@ #define __pfn_to_page(pfn) (mem_map + ((pfn) - ARCH_PFN_OFFSET)) #define __page_to_pfn(page) ((unsigned long)((page) - mem_map) + \ ARCH_PFN_OFFSET) +#define PFN_PAGE_MAP_LINEAR #elif defined(CONFIG_DISCONTIGMEM) #define __pfn_to_page(pfn) \ @@ -53,6 +54,7 @@ /* memmap is virtually contiguous. */ #define __pfn_to_page(pfn) (vmemmap + (pfn)) #define __page_to_pfn(page) (unsigned long)((page) - vmemmap) +#define PFN_PAGE_MAP_LINEAR #elif defined(CONFIG_SPARSEMEM) /* diff --git a/mm/internal.h b/mm/internal.h index 25d2b2439f19..64cc5069047c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -454,12 +454,14 @@ static inline struct page *mem_map_offset(struct page *base, int offset) static inline struct page *mem_map_next(struct page *iter, struct page *base, int offset) { +#ifndef PFN_PAGE_MAP_LINEAR if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) { unsigned long pfn = page_to_pfn(base) + offset; if (!pfn_valid(pfn)) return NULL; return pfn_to_page(pfn); } +#endif return iter + 1; }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4bdb58ab14cb..94e9fa803294 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1312,14 +1312,16 @@ static inline void destroy_compound_gigantic_page(struct page *page, static void update_and_free_page(struct hstate *h, struct page *page) { int i; + struct page *subpage = page; if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) return; h->nr_huge_pages--; h->nr_huge_pages_node[page_to_nid(page)]--; - for (i = 0; i < pages_per_huge_page(h); i++) { - page[i].flags &= ~(1 << PG_locked | 1 << PG_error | + for (i = 0; i < pages_per_huge_page(h); + i++, subpage = mem_map_next(subpage, page, i)) { + subpage->flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced | 1 << PG_dirty | 1 << PG_active | 1 << PG_private | 1 << PG_writeback);