Message ID | 20230218002819.1486479-10-jthoughton@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | hugetlb: introduce HugeTLB high-granularity mapping | expand |
On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote: > > Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable > HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be > applied to non-HugeTLB memory in the future, if such an application is > to arise. > > MADV_SPLIT provides several API changes for some syscalls on HugeTLB > address ranges: > 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE > alignment. > 2. read()ing a page fault event from a userfaultfd will yield a > PAGE_SIZE-rounded address, instead of a huge-page-size-rounded > address (unless UFFD_FEATURE_EXACT_ADDRESS is used). > > There is no way to disable the API changes that come with issuing > MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page > table mappings that come from the extended functionality that comes with > using MADV_SPLIT. > So is a hugetlb page or VMA that has been MADV_SPLIT + MADV_COLLAPSE distinct from a hugetlb page or vma that has not been? I thought COLLAPSE would reverse the effects on SPLIT completely. > For post-copy live migration, the expected use-case is: > 1. mmap(MAP_SHARED, some_fd) primary mapping > 2. mmap(MAP_SHARED, some_fd) alias mapping > 3. MADV_SPLIT the primary mapping > 4. UFFDIO_REGISTER/etc. the primary mapping > 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the > corresponding PAGE_SIZE sections in the primary mapping. > Huh, so MADV_SPLIT doesn't actually split an existing PMD mapping into high granularity mappings. Instead it says that future mappings may be high granularity? I assume they may not even be high granularity, like if the alias mapping faulted in a full hugetlb page (without UFFDIO_CONTINUE) that page would be regular mapped not high granularity mapped. This may be bikeshedding but I do think a clearer name is warranted. Maybe MADV_MAY_SPLIT or something. > More API changes may be added in the future. > > Signed-off-by: James Houghton <jthoughton@google.com> > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > index 763929e814e9..7a26f3648b90 100644 > --- a/arch/alpha/include/uapi/asm/mman.h > +++ b/arch/alpha/include/uapi/asm/mman.h > @@ -78,6 +78,8 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > index c6e1fc77c996..f8a74a3a0928 100644 > --- a/arch/mips/include/uapi/asm/mman.h > +++ b/arch/mips/include/uapi/asm/mman.h > @@ -105,6 +105,8 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > index 68c44f99bc93..a6dc6a56c941 100644 > --- a/arch/parisc/include/uapi/asm/mman.h > +++ b/arch/parisc/include/uapi/asm/mman.h > @@ -72,6 +72,8 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_SPLIT 74 /* Enable hugepage high-granularity APIs */ > + > #define MADV_HWPOISON 100 /* poison a page for testing */ > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > index 1ff0c858544f..f98a77c430a9 100644 > --- a/arch/xtensa/include/uapi/asm/mman.h > +++ b/arch/xtensa/include/uapi/asm/mman.h > @@ -113,6 +113,8 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index 6ce1f1ceb432..996e8ded092f 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -79,6 +79,8 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/mm/madvise.c b/mm/madvise.c > index c2202f51e9dd..8c004c678262 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1006,6 +1006,28 @@ static long madvise_remove(struct vm_area_struct *vma, > return error; > } > > +static int madvise_split(struct vm_area_struct *vma, > + unsigned long *new_flags) > +{ > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING > + if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma)) > + return -EINVAL; > + > + /* > + * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part > + * of a VMA, then we will split the VMA. Here, we're unsharing before > + * splitting because it's simpler, although we may be unsharing more > + * than we need. > + */ > + hugetlb_unshare_all_pmds(vma); > + > + *new_flags |= VM_HUGETLB_HGM; > + return 0; > +#else > + return -EINVAL; > +#endif > +} > + > /* > * Apply an madvise behavior to a region of a vma. madvise_update_vma > * will handle splitting a vm area into separate areas, each area with its own > @@ -1084,6 +1106,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > break; > case MADV_COLLAPSE: > return madvise_collapse(vma, prev, start, end); > + case MADV_SPLIT: > + error = madvise_split(vma, &new_flags); > + if (error) > + goto out; > + break; > } > > anon_name = anon_vma_name(vma); > @@ -1178,6 +1205,9 @@ madvise_behavior_valid(int behavior) > case MADV_HUGEPAGE: > case MADV_NOHUGEPAGE: > case MADV_COLLAPSE: > +#endif > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING > + case MADV_SPLIT: > #endif > case MADV_DONTDUMP: > case MADV_DODUMP: > @@ -1368,6 +1398,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > * transparent huge pages so the existing pages will not be > * coalesced into THP and new pages will not be allocated as THP. > * MADV_COLLAPSE - synchronously coalesce pages into new THP. > + * MADV_SPLIT - allow HugeTLB pages to be mapped at PAGE_SIZE. This allows > + * UFFDIO_CONTINUE to accept PAGE_SIZE-aligned regions. > * MADV_DONTDUMP - the application wants to prevent pages in the given range > * from being included in its core dump. > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > -- > 2.39.2.637.g21b0678d19-goog >
On Fri, Feb 17, 2023 at 5:58 PM Mina Almasry <almasrymina@google.com> wrote: > > On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote: > > > > Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable > > HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be > > applied to non-HugeTLB memory in the future, if such an application is > > to arise. > > > > MADV_SPLIT provides several API changes for some syscalls on HugeTLB > > address ranges: > > 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE > > alignment. > > 2. read()ing a page fault event from a userfaultfd will yield a > > PAGE_SIZE-rounded address, instead of a huge-page-size-rounded > > address (unless UFFD_FEATURE_EXACT_ADDRESS is used). > > > > There is no way to disable the API changes that come with issuing > > MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page > > table mappings that come from the extended functionality that comes with > > using MADV_SPLIT. > > > > So is a hugetlb page or VMA that has been MADV_SPLIT + MADV_COLLAPSE > distinct from a hugetlb page or vma that has not been? I thought > COLLAPSE would reverse the effects on SPLIT completely. Right now, MADV_COLLAPSE does *not* completely undo the effects of an MADV_SPLIT. The API changes that come from MADV_SPLIT aren't undone with an MADV_COLLAPSE. > > > For post-copy live migration, the expected use-case is: > > 1. mmap(MAP_SHARED, some_fd) primary mapping > > 2. mmap(MAP_SHARED, some_fd) alias mapping > > 3. MADV_SPLIT the primary mapping > > 4. UFFDIO_REGISTER/etc. the primary mapping > > 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the > > corresponding PAGE_SIZE sections in the primary mapping. > > > > Huh, so MADV_SPLIT doesn't actually split an existing PMD mapping into > high granularity mappings. Instead it says that future mappings may be > high granularity? I assume they may not even be high granularity, like > if the alias mapping faulted in a full hugetlb page (without > UFFDIO_CONTINUE) that page would be regular mapped not high > granularity mapped. MADV_SPLIT just means "userspace is aware that they are able to start mapping HugeTLB pages at high-granularity". Right now the only way to get high-granularity mappings is with UFFDIO_CONTINUE, but there may be other ways in the future. As of this series, if you MADV_SPLIT a HugeTLB VMA and you aren't using userfaultfd minor faults, it's basically a no-op. The mappings that are created will still be huge. I could change this, but I don't really see a reason to right now. > > This may be bikeshedding but I do think a clearer name is warranted. > Maybe MADV_MAY_SPLIT or something. I agree -- MADV_MAY_SPLIT more accurately describes the HugeTLB functionality. I really don't mind what the MADV is called. I think enabling the high-granularity userfaultfd bits with a userfaultfd feature[1] worked reasonably well. There is some API discussion in that thread[1]. [1]: https://lore.kernel.org/linux-mm/20221021163703.3218176-34-jthoughton@google.com/
On 02/18/23 00:27, James Houghton wrote: > Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable > HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be > applied to non-HugeTLB memory in the future, if such an application is > to arise. > > MADV_SPLIT provides several API changes for some syscalls on HugeTLB > address ranges: > 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE > alignment. > 2. read()ing a page fault event from a userfaultfd will yield a > PAGE_SIZE-rounded address, instead of a huge-page-size-rounded > address (unless UFFD_FEATURE_EXACT_ADDRESS is used). > > There is no way to disable the API changes that come with issuing > MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page > table mappings that come from the extended functionality that comes with > using MADV_SPLIT. > > For post-copy live migration, the expected use-case is: > 1. mmap(MAP_SHARED, some_fd) primary mapping > 2. mmap(MAP_SHARED, some_fd) alias mapping > 3. MADV_SPLIT the primary mapping > 4. UFFDIO_REGISTER/etc. the primary mapping > 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the > corresponding PAGE_SIZE sections in the primary mapping. > > More API changes may be added in the future. > > Signed-off-by: James Houghton <jthoughton@google.com> > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > index 763929e814e9..7a26f3648b90 100644 > --- a/arch/alpha/include/uapi/asm/mman.h > +++ b/arch/alpha/include/uapi/asm/mman.h > @@ -78,6 +78,8 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > index c6e1fc77c996..f8a74a3a0928 100644 > --- a/arch/mips/include/uapi/asm/mman.h > +++ b/arch/mips/include/uapi/asm/mman.h > @@ -105,6 +105,8 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > index 68c44f99bc93..a6dc6a56c941 100644 > --- a/arch/parisc/include/uapi/asm/mman.h > +++ b/arch/parisc/include/uapi/asm/mman.h > @@ -72,6 +72,8 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_SPLIT 74 /* Enable hugepage high-granularity APIs */ > + > #define MADV_HWPOISON 100 /* poison a page for testing */ > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > index 1ff0c858544f..f98a77c430a9 100644 > --- a/arch/xtensa/include/uapi/asm/mman.h > +++ b/arch/xtensa/include/uapi/asm/mman.h > @@ -113,6 +113,8 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index 6ce1f1ceb432..996e8ded092f 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -79,6 +79,8 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/mm/madvise.c b/mm/madvise.c > index c2202f51e9dd..8c004c678262 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1006,6 +1006,28 @@ static long madvise_remove(struct vm_area_struct *vma, > return error; > } > > +static int madvise_split(struct vm_area_struct *vma, > + unsigned long *new_flags) > +{ > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING > + if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma)) > + return -EINVAL; > + > + /* > + * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part > + * of a VMA, then we will split the VMA. Here, we're unsharing before > + * splitting because it's simpler, although we may be unsharing more > + * than we need. > + */ > + hugetlb_unshare_all_pmds(vma); I think we should just unshare the (appropriately aligned) range within the vma that is the target of MADV_SPLIT. No need to unshare the entire vma. > + > + *new_flags |= VM_HUGETLB_HGM; > + return 0; > +#else > + return -EINVAL; > +#endif > +} > + > /* > * Apply an madvise behavior to a region of a vma. madvise_update_vma > * will handle splitting a vm area into separate areas, each area with its own > @@ -1084,6 +1106,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > break; > case MADV_COLLAPSE: > return madvise_collapse(vma, prev, start, end); > + case MADV_SPLIT: > + error = madvise_split(vma, &new_flags); > + if (error) > + goto out; Not a huge deal, but if one passes an invalid range (such as not huge page size aligned) to MADV_SPLIT, then we will not notice the error until later in madvise_update_vma() when the vma split fails. By then, we will have unshared all pmds in the entire vma (or just the range if you agree with my suggestion above).
On Fri, Feb 24, 2023 at 3:25 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 02/18/23 00:27, James Houghton wrote: > > Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable > > HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be > > applied to non-HugeTLB memory in the future, if such an application is > > to arise. > > > > MADV_SPLIT provides several API changes for some syscalls on HugeTLB > > address ranges: > > 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE > > alignment. > > 2. read()ing a page fault event from a userfaultfd will yield a > > PAGE_SIZE-rounded address, instead of a huge-page-size-rounded > > address (unless UFFD_FEATURE_EXACT_ADDRESS is used). > > > > There is no way to disable the API changes that come with issuing > > MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page > > table mappings that come from the extended functionality that comes with > > using MADV_SPLIT. > > > > For post-copy live migration, the expected use-case is: > > 1. mmap(MAP_SHARED, some_fd) primary mapping > > 2. mmap(MAP_SHARED, some_fd) alias mapping > > 3. MADV_SPLIT the primary mapping > > 4. UFFDIO_REGISTER/etc. the primary mapping > > 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the > > corresponding PAGE_SIZE sections in the primary mapping. > > > > More API changes may be added in the future. > > > > Signed-off-by: James Houghton <jthoughton@google.com> > > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > > index 763929e814e9..7a26f3648b90 100644 > > --- a/arch/alpha/include/uapi/asm/mman.h > > +++ b/arch/alpha/include/uapi/asm/mman.h > > @@ -78,6 +78,8 @@ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > > + > > /* compatibility flags */ > > #define MAP_FILE 0 > > > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > > index c6e1fc77c996..f8a74a3a0928 100644 > > --- a/arch/mips/include/uapi/asm/mman.h > > +++ b/arch/mips/include/uapi/asm/mman.h > > @@ -105,6 +105,8 @@ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > > + > > /* compatibility flags */ > > #define MAP_FILE 0 > > > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > > index 68c44f99bc93..a6dc6a56c941 100644 > > --- a/arch/parisc/include/uapi/asm/mman.h > > +++ b/arch/parisc/include/uapi/asm/mman.h > > @@ -72,6 +72,8 @@ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > +#define MADV_SPLIT 74 /* Enable hugepage high-granularity APIs */ > > + > > #define MADV_HWPOISON 100 /* poison a page for testing */ > > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > > > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > > index 1ff0c858544f..f98a77c430a9 100644 > > --- a/arch/xtensa/include/uapi/asm/mman.h > > +++ b/arch/xtensa/include/uapi/asm/mman.h > > @@ -113,6 +113,8 @@ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > > + > > /* compatibility flags */ > > #define MAP_FILE 0 > > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > > index 6ce1f1ceb432..996e8ded092f 100644 > > --- a/include/uapi/asm-generic/mman-common.h > > +++ b/include/uapi/asm-generic/mman-common.h > > @@ -79,6 +79,8 @@ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > > + > > /* compatibility flags */ > > #define MAP_FILE 0 > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index c2202f51e9dd..8c004c678262 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -1006,6 +1006,28 @@ static long madvise_remove(struct vm_area_struct *vma, > > return error; > > } > > > > +static int madvise_split(struct vm_area_struct *vma, > > + unsigned long *new_flags) > > +{ > > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING > > + if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma)) > > + return -EINVAL; > > + > > + /* > > + * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part > > + * of a VMA, then we will split the VMA. Here, we're unsharing before > > + * splitting because it's simpler, although we may be unsharing more > > + * than we need. > > + */ > > + hugetlb_unshare_all_pmds(vma); > > I think we should just unshare the (appropriately aligned) range within the > vma that is the target of MADV_SPLIT. No need to unshare the entire vma. Right I can do that, and I can check for appropriate alignment here (else fail with -EINVAL). > > > + > > + *new_flags |= VM_HUGETLB_HGM; > > + return 0; > > +#else > > + return -EINVAL; > > +#endif > > +} > > + > > /* > > * Apply an madvise behavior to a region of a vma. madvise_update_vma > > * will handle splitting a vm area into separate areas, each area with its own > > @@ -1084,6 +1106,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > > break; > > case MADV_COLLAPSE: > > return madvise_collapse(vma, prev, start, end); > > + case MADV_SPLIT: > > + error = madvise_split(vma, &new_flags); > > + if (error) > > + goto out; > > Not a huge deal, but if one passes an invalid range (such as not huge page > size aligned) to MADV_SPLIT, then we will not notice the error until > later in madvise_update_vma() when the vma split fails. By then, we will > have unshared all pmds in the entire vma (or just the range if you agree > with my suggestion above). Good point. I'll fix this for v3. :) Thanks Mike.
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..7a26f3648b90 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -78,6 +78,8 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index c6e1fc77c996..f8a74a3a0928 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -105,6 +105,8 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 68c44f99bc93..a6dc6a56c941 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -72,6 +72,8 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_SPLIT 74 /* Enable hugepage high-granularity APIs */ + #define MADV_HWPOISON 100 /* poison a page for testing */ #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 1ff0c858544f..f98a77c430a9 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -113,6 +113,8 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..996e8ded092f 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -79,6 +79,8 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/madvise.c b/mm/madvise.c index c2202f51e9dd..8c004c678262 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1006,6 +1006,28 @@ static long madvise_remove(struct vm_area_struct *vma, return error; } +static int madvise_split(struct vm_area_struct *vma, + unsigned long *new_flags) +{ +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING + if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma)) + return -EINVAL; + + /* + * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part + * of a VMA, then we will split the VMA. Here, we're unsharing before + * splitting because it's simpler, although we may be unsharing more + * than we need. + */ + hugetlb_unshare_all_pmds(vma); + + *new_flags |= VM_HUGETLB_HGM; + return 0; +#else + return -EINVAL; +#endif +} + /* * Apply an madvise behavior to a region of a vma. madvise_update_vma * will handle splitting a vm area into separate areas, each area with its own @@ -1084,6 +1106,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, break; case MADV_COLLAPSE: return madvise_collapse(vma, prev, start, end); + case MADV_SPLIT: + error = madvise_split(vma, &new_flags); + if (error) + goto out; + break; } anon_name = anon_vma_name(vma); @@ -1178,6 +1205,9 @@ madvise_behavior_valid(int behavior) case MADV_HUGEPAGE: case MADV_NOHUGEPAGE: case MADV_COLLAPSE: +#endif +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING + case MADV_SPLIT: #endif case MADV_DONTDUMP: case MADV_DODUMP: @@ -1368,6 +1398,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, * transparent huge pages so the existing pages will not be * coalesced into THP and new pages will not be allocated as THP. * MADV_COLLAPSE - synchronously coalesce pages into new THP. + * MADV_SPLIT - allow HugeTLB pages to be mapped at PAGE_SIZE. This allows + * UFFDIO_CONTINUE to accept PAGE_SIZE-aligned regions. * MADV_DONTDUMP - the application wants to prevent pages in the given range * from being included in its core dump. * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be applied to non-HugeTLB memory in the future, if such an application is to arise. MADV_SPLIT provides several API changes for some syscalls on HugeTLB address ranges: 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE alignment. 2. read()ing a page fault event from a userfaultfd will yield a PAGE_SIZE-rounded address, instead of a huge-page-size-rounded address (unless UFFD_FEATURE_EXACT_ADDRESS is used). There is no way to disable the API changes that come with issuing MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page table mappings that come from the extended functionality that comes with using MADV_SPLIT. For post-copy live migration, the expected use-case is: 1. mmap(MAP_SHARED, some_fd) primary mapping 2. mmap(MAP_SHARED, some_fd) alias mapping 3. MADV_SPLIT the primary mapping 4. UFFDIO_REGISTER/etc. the primary mapping 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the corresponding PAGE_SIZE sections in the primary mapping. More API changes may be added in the future. Signed-off-by: James Houghton <jthoughton@google.com>