Message ID | 20180710184903.68239-1-cannonmatthews@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 07/10/2018 11:49 AM, Cannon Matthews wrote: > When using 1GiB pages during early boot, use the new > memblock_virt_alloc_try_nid_raw() function to allocate memory without > zeroing it. Zeroing out hundreds or thousands of GiB in a single core > memset() call is very slow, and can make early boot last upwards of > 20-30 minutes on multi TiB machines. > > To be safe, still zero the first sizeof(struct boomem_huge_page) bytes > since this is used a temporary storage place for this info until > gather_bootmem_prealloc() processes them later. > > The rest of the memory does not need to be zero'd as the hugetlb pages > are always zero'd on page fault. > > Tested: Booted with ~3800 1G pages, and it booted successfully in > roughly the same amount of time as with 0, as opposed to the 25+ > minutes it would take before. > Nice improvement! > Signed-off-by: Cannon Matthews <cannonmatthews@google.com> > --- > mm/hugetlb.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 3612fbb32e9d..c93a2c77e881 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2101,7 +2101,7 @@ int __alloc_bootmem_huge_page(struct hstate *h) > for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) { > void *addr; > > - addr = memblock_virt_alloc_try_nid_nopanic( > + addr = memblock_virt_alloc_try_nid_raw( > huge_page_size(h), huge_page_size(h), > 0, BOOTMEM_ALLOC_ACCESSIBLE, node); > if (addr) { > @@ -2109,7 +2109,12 @@ int __alloc_bootmem_huge_page(struct hstate *h) > * Use the beginning of the huge page to store the > * huge_bootmem_page struct (until gather_bootmem > * puts them into the mem_map). > + * > + * memblock_virt_alloc_try_nid_raw returns non-zero'd > + * memory so zero out just enough for this struct, the > + * rest will be zero'd on page fault. > */ > + memset(addr, 0, sizeof(struct huge_bootmem_page)); This forced me to look at the usage of huge_bootmem_page. It is defined as: struct huge_bootmem_page { struct list_head list; struct hstate *hstate; #ifdef CONFIG_HIGHMEM phys_addr_t phys; #endif }; The list and hstate fields are set immediately after allocating the memory block here and elsewhere. However, I can't find any code that sets phys. Although, it is potentially used in gather_bootmem_prealloc(). It appears powerpc used this field at one time, but no longer does. Am I missing something? Not an issue with this patch, rather existing code. I'd prefer not to do the memset() "just to be safe". Unless I am missing something, I would like to remove phys field and supporting code first. Then, this patch without the memset.
On Tue 10-07-18 11:49:03, Cannon Matthews wrote: > When using 1GiB pages during early boot, use the new > memblock_virt_alloc_try_nid_raw() function to allocate memory without > zeroing it. Zeroing out hundreds or thousands of GiB in a single core > memset() call is very slow, and can make early boot last upwards of > 20-30 minutes on multi TiB machines. > > To be safe, still zero the first sizeof(struct boomem_huge_page) bytes > since this is used a temporary storage place for this info until > gather_bootmem_prealloc() processes them later. > > The rest of the memory does not need to be zero'd as the hugetlb pages > are always zero'd on page fault. > > Tested: Booted with ~3800 1G pages, and it booted successfully in > roughly the same amount of time as with 0, as opposed to the 25+ > minutes it would take before. The patch makes perfect sense to me. I wasn't even aware that it zeroying memblock allocation. Thanks for spotting this and fixing it. > Signed-off-by: Cannon Matthews <cannonmatthews@google.com> I just do not think we need to to zero huge_bootmem_page portion of it. It should be sufficient to INIT_LIST_HEAD before list_add. We do initialize the rest explicitly already. > --- > mm/hugetlb.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 3612fbb32e9d..c93a2c77e881 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2101,7 +2101,7 @@ int __alloc_bootmem_huge_page(struct hstate *h) > for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) { > void *addr; > > - addr = memblock_virt_alloc_try_nid_nopanic( > + addr = memblock_virt_alloc_try_nid_raw( > huge_page_size(h), huge_page_size(h), > 0, BOOTMEM_ALLOC_ACCESSIBLE, node); > if (addr) { > @@ -2109,7 +2109,12 @@ int __alloc_bootmem_huge_page(struct hstate *h) > * Use the beginning of the huge page to store the > * huge_bootmem_page struct (until gather_bootmem > * puts them into the mem_map). > + * > + * memblock_virt_alloc_try_nid_raw returns non-zero'd > + * memory so zero out just enough for this struct, the > + * rest will be zero'd on page fault. > */ > + memset(addr, 0, sizeof(struct huge_bootmem_page)); > m = addr; > goto found; > } > -- > 2.18.0.203.gfac676dfb9-goog
On Wed 11-07-18 14:47:11, Michal Hocko wrote: > On Tue 10-07-18 11:49:03, Cannon Matthews wrote: > > When using 1GiB pages during early boot, use the new > > memblock_virt_alloc_try_nid_raw() function to allocate memory without > > zeroing it. Zeroing out hundreds or thousands of GiB in a single core > > memset() call is very slow, and can make early boot last upwards of > > 20-30 minutes on multi TiB machines. > > > > To be safe, still zero the first sizeof(struct boomem_huge_page) bytes > > since this is used a temporary storage place for this info until > > gather_bootmem_prealloc() processes them later. > > > > The rest of the memory does not need to be zero'd as the hugetlb pages > > are always zero'd on page fault. > > > > Tested: Booted with ~3800 1G pages, and it booted successfully in > > roughly the same amount of time as with 0, as opposed to the 25+ > > minutes it would take before. > > The patch makes perfect sense to me. I wasn't even aware that it > zeroying memblock allocation. Thanks for spotting this and fixing it. > > > Signed-off-by: Cannon Matthews <cannonmatthews@google.com> > > I just do not think we need to to zero huge_bootmem_page portion of it. > It should be sufficient to INIT_LIST_HEAD before list_add. We do > initialize the rest explicitly already. Forgot to mention that after that is addressed you can add Acked-by: Michal Hocko <mhocko@suse.com>
On Tue 10-07-18 13:46:57, Mike Kravetz wrote: > On 07/10/2018 11:49 AM, Cannon Matthews wrote: > > When using 1GiB pages during early boot, use the new > > memblock_virt_alloc_try_nid_raw() function to allocate memory without > > zeroing it. Zeroing out hundreds or thousands of GiB in a single core > > memset() call is very slow, and can make early boot last upwards of > > 20-30 minutes on multi TiB machines. > > > > To be safe, still zero the first sizeof(struct boomem_huge_page) bytes > > since this is used a temporary storage place for this info until > > gather_bootmem_prealloc() processes them later. > > > > The rest of the memory does not need to be zero'd as the hugetlb pages > > are always zero'd on page fault. > > > > Tested: Booted with ~3800 1G pages, and it booted successfully in > > roughly the same amount of time as with 0, as opposed to the 25+ > > minutes it would take before. > > > > Nice improvement! > > > Signed-off-by: Cannon Matthews <cannonmatthews@google.com> > > --- > > mm/hugetlb.c | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 3612fbb32e9d..c93a2c77e881 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -2101,7 +2101,7 @@ int __alloc_bootmem_huge_page(struct hstate *h) > > for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) { > > void *addr; > > > > - addr = memblock_virt_alloc_try_nid_nopanic( > > + addr = memblock_virt_alloc_try_nid_raw( > > huge_page_size(h), huge_page_size(h), > > 0, BOOTMEM_ALLOC_ACCESSIBLE, node); > > if (addr) { > > @@ -2109,7 +2109,12 @@ int __alloc_bootmem_huge_page(struct hstate *h) > > * Use the beginning of the huge page to store the > > * huge_bootmem_page struct (until gather_bootmem > > * puts them into the mem_map). > > + * > > + * memblock_virt_alloc_try_nid_raw returns non-zero'd > > + * memory so zero out just enough for this struct, the > > + * rest will be zero'd on page fault. > > */ > > + memset(addr, 0, sizeof(struct huge_bootmem_page)); > > This forced me to look at the usage of huge_bootmem_page. It is defined as: > struct huge_bootmem_page { > struct list_head list; > struct hstate *hstate; > #ifdef CONFIG_HIGHMEM > phys_addr_t phys; > #endif > }; > > The list and hstate fields are set immediately after allocating the memory > block here and elsewhere. However, I can't find any code that sets phys. > Although, it is potentially used in gather_bootmem_prealloc(). It appears > powerpc used this field at one time, but no longer does. > > Am I missing something? If yes, then I am missing it as well. phys is a cool name to grep for... Anyway, does it really make any sense to allow gigantic pages on HIGHMEM systems in the first place?
On 07/11/2018 05:48 AM, Michal Hocko wrote: > On Wed 11-07-18 14:47:11, Michal Hocko wrote: >> On Tue 10-07-18 11:49:03, Cannon Matthews wrote: >>> When using 1GiB pages during early boot, use the new >>> memblock_virt_alloc_try_nid_raw() function to allocate memory without >>> zeroing it. Zeroing out hundreds or thousands of GiB in a single core >>> memset() call is very slow, and can make early boot last upwards of >>> 20-30 minutes on multi TiB machines. >>> >>> To be safe, still zero the first sizeof(struct boomem_huge_page) bytes >>> since this is used a temporary storage place for this info until >>> gather_bootmem_prealloc() processes them later. >>> >>> The rest of the memory does not need to be zero'd as the hugetlb pages >>> are always zero'd on page fault. >>> >>> Tested: Booted with ~3800 1G pages, and it booted successfully in >>> roughly the same amount of time as with 0, as opposed to the 25+ >>> minutes it would take before. >> >> The patch makes perfect sense to me. I wasn't even aware that it >> zeroying memblock allocation. Thanks for spotting this and fixing it. >> >>> Signed-off-by: Cannon Matthews <cannonmatthews@google.com> >> >> I just do not think we need to to zero huge_bootmem_page portion of it. >> It should be sufficient to INIT_LIST_HEAD before list_add. We do >> initialize the rest explicitly already. > > Forgot to mention that after that is addressed you can add > Acked-by: Michal Hocko <mhocko@suse.com> Cannon, How about if you make this change suggested by Michal, and I will submit a separate patch to revert the patch which added the phys field to huge_bootmem_page structure. FWIW, Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3612fbb32e9d..c93a2c77e881 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2101,7 +2101,7 @@ int __alloc_bootmem_huge_page(struct hstate *h) for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) { void *addr; - addr = memblock_virt_alloc_try_nid_nopanic( + addr = memblock_virt_alloc_try_nid_raw( huge_page_size(h), huge_page_size(h), 0, BOOTMEM_ALLOC_ACCESSIBLE, node); if (addr) { @@ -2109,7 +2109,12 @@ int __alloc_bootmem_huge_page(struct hstate *h) * Use the beginning of the huge page to store the * huge_bootmem_page struct (until gather_bootmem * puts them into the mem_map). + * + * memblock_virt_alloc_try_nid_raw returns non-zero'd + * memory so zero out just enough for this struct, the + * rest will be zero'd on page fault. */ + memset(addr, 0, sizeof(struct huge_bootmem_page)); m = addr; goto found; }
When using 1GiB pages during early boot, use the new memblock_virt_alloc_try_nid_raw() function to allocate memory without zeroing it. Zeroing out hundreds or thousands of GiB in a single core memset() call is very slow, and can make early boot last upwards of 20-30 minutes on multi TiB machines. To be safe, still zero the first sizeof(struct boomem_huge_page) bytes since this is used a temporary storage place for this info until gather_bootmem_prealloc() processes them later. The rest of the memory does not need to be zero'd as the hugetlb pages are always zero'd on page fault. Tested: Booted with ~3800 1G pages, and it booted successfully in roughly the same amount of time as with 0, as opposed to the 25+ minutes it would take before. Signed-off-by: Cannon Matthews <cannonmatthews@google.com> --- mm/hugetlb.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) -- 2.18.0.203.gfac676dfb9-goog