Message ID | 1585451295-22302-1-git-send-email-lixinhai.lxh@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: allow checking length for hugetlb mapping in mmap() | expand |
On 3/28/20 8:08 PM, Li Xinhai wrote: > In current code, the vma related call of hugetlb mapping, except mmap, > are all consider not correctly aligned length as invalid parameter, > including mprotect,munmap, mlock, etc., by checking through > hugetlb_vm_op_split. So, user will see failure, after successfully call > mmap, although using same length parameter to other mapping syscall. > > It is desirable for all hugetlb mapping calls have consistent behavior, > without mmap as exception(which round up length to align underlying > hugepage size). In current Documentation/admin-guide/mm/hugetlbpage.rst, > the description is: > " > Syscalls that operate on memory backed by hugetlb pages only have their > lengths aligned to the native page size of the processor; they will > normally fail with errno set to EINVAL or exclude hugetlb pages that > extend beyond the length if not hugepage aligned. For example, munmap(2) > will fail if memory is backed by a hugetlb page and the length is smaller > than the hugepage size. > " > which express the consistent behavior. Missing here is a description of what the patch actually does... > > Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Cc: John Hubbard <jhubbard@nvidia.com> > --- > changes: > 0. patch which introduce new flag for mmap() > The new flag should be avoided. > https://lore.kernel.org/linux-mm/1585313944-8627-1-git-send-email-lixinhai.lxh@gmail.com/ > > mm/mmap.c | 8 -------- > 1 file changed, 8 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index d681a20..b2aa102 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1560,20 +1560,12 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, > file = fget(fd); > if (!file) > return -EBADF; > - if (is_file_hugepages(file)) > - len = ALIGN(len, huge_page_size(hstate_file(file))); ...and it looks like this is simply removing the forced alignment. And not adding any error case for non-aligned cases. So now I'm not immediately sure what happens if a non-aligned address is passed in. I would have expected to see either error checking or an ALIGN call here, but now both are gone, so I'm lost and confused. :) thanks,
On 2020-03-29 at 11:53 John Hubbard wrote: >On 3/28/20 8:08 PM, Li Xinhai wrote: >> In current code, the vma related call of hugetlb mapping, except mmap, >> are all consider not correctly aligned length as invalid parameter, >> including mprotect,munmap, mlock, etc., by checking through >> hugetlb_vm_op_split. So, user will see failure, after successfully call >> mmap, although using same length parameter to other mapping syscall. >> >> It is desirable for all hugetlb mapping calls have consistent behavior, >> without mmap as exception(which round up length to align underlying >> hugepage size). In current Documentation/admin-guide/mm/hugetlbpage.rst, >> the description is: >> " >> Syscalls that operate on memory backed by hugetlb pages only have their >> lengths aligned to the native page size of the processor; they will >> normally fail with errno set to EINVAL or exclude hugetlb pages that >> extend beyond the length if not hugepage aligned. For example, munmap(2) >> will fail if memory is backed by a hugetlb page and the length is smaller >> than the hugepage size. >> " >> which express the consistent behavior. > > >Missing here is a description of what the patch actually does... > right, more statement can be added like: " After this patch, all hugetlb mapping related syscall wil only align length parameter to the native page size of the processor. For mmap(), hugetlb_get_unmmaped_area() will set errno to EINVAL if length is not aligned to underlying hugepage size. " >> >> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Mike Kravetz <mike.kravetz@oracle.com> >> Cc: John Hubbard <jhubbard@nvidia.com> >> --- >> changes: >> 0. patch which introduce new flag for mmap() >> The new flag should be avoided. >> https://lore.kernel.org/linux-mm/1585313944-8627-1-git-send-email-lixinhai.lxh@gmail.com/ >> >> mm/mmap.c | 8 -------- >> 1 file changed, 8 deletions(-) >> >> diff --git a/mm/mmap.c b/mm/mmap.c >> index d681a20..b2aa102 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -1560,20 +1560,12 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, >> file = fget(fd); >> if (!file) >> return -EBADF; >> - if (is_file_hugepages(file)) >> - len = ALIGN(len, huge_page_size(hstate_file(file))); > > >...and it looks like this is simply removing the forced alignment. And not adding >any error case for non-aligned cases. So now I'm not immediately sure what happens if a >non-aligned address is passed in. > >I would have expected to see either error checking or an ALIGN call here, but now both >are gone, so I'm lost and confused. :) > After this patch, the alignement will only on "native page size of the processor" as done in do_mmap(). Then, following the code path, checking further by hugetlb_get_unmmaped_area() according to underlying hugepage size. > >thanks, >-- >John Hubbard >NVIDIA > >> retval = -EINVAL; >> if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file))) >> goto out_fput; >> } else if (flags & MAP_HUGETLB) { >> struct user_struct *user = NULL; >> - struct hstate *hs; >> >> - hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); >> - if (!hs) >> - return -EINVAL; >> - >> - len = ALIGN(len, huge_page_size(hs)); >> /* >> * VM_NORESERVE is used because the reservations will be >> * taken when vm_ops->mmap() is called >>
On 3/29/20 1:09 AM, Li Xinhai wrote: > On 2020-03-29 at 11:53 John Hubbard wrote: >> On 3/28/20 8:08 PM, Li Xinhai wrote: >>> In current code, the vma related call of hugetlb mapping, except mmap, >>> are all consider not correctly aligned length as invalid parameter, >>> including mprotect,munmap, mlock, etc., by checking through >>> hugetlb_vm_op_split. So, user will see failure, after successfully call >>> mmap, although using same length parameter to other mapping syscall. >>> >>> It is desirable for all hugetlb mapping calls have consistent behavior, >>> without mmap as exception(which round up length to align underlying >>> hugepage size). In current Documentation/admin-guide/mm/hugetlbpage.rst, >>> the description is: >>> " >>> Syscalls that operate on memory backed by hugetlb pages only have their >>> lengths aligned to the native page size of the processor; they will >>> normally fail with errno set to EINVAL or exclude hugetlb pages that >>> extend beyond the length if not hugepage aligned. For example, munmap(2) >>> will fail if memory is backed by a hugetlb page and the length is smaller >>> than the hugepage size. >>> " >>> which express the consistent behavior. >> >> >> Missing here is a description of what the patch actually does... >> > > right, more statement can be added like: > " > After this patch, all hugetlb mapping related syscall wil only align > length parameter to the native page size of the processor. For mmap(), > hugetlb_get_unmmaped_area() will set errno to EINVAL if length is not > aligned to underlying hugepage size. > " > >>> >>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> Cc: Mike Kravetz <mike.kravetz@oracle.com> >>> Cc: John Hubbard <jhubbard@nvidia.com> >>> --- >>> changes: >>> 0. patch which introduce new flag for mmap() >>> The new flag should be avoided. >>> https://lore.kernel.org/linux-mm/1585313944-8627-1-git-send-email-lixinhai.lxh@gmail.com/ It is not exactly clear in your commit message, but this change will cause mmap() of hugetlb ranges to fail (-EINVAL) if length is not a multiple of huge page size. The mmap man page says: Huge page (Huge TLB) mappings For mappings that employ huge pages, the requirements for the arguments of mmap() and munmap() differ somewhat from the requirements for map‐ pings that use the native system page size. For mmap(), offset must be a multiple of the underlying huge page size. The system automatically aligns length to be a multiple of the underly‐ ing huge page size. For munmap(), addr and length must both be a multiple of the underlying huge page size. So this change may cause application failure. The code you are removing was added with commit af73e4d9506d. The commit message for that commit says: hugetlbfs: fix mmap failure in unaligned size request The current kernel returns -EINVAL unless a given mmap length is "almost" hugepage aligned. This is because in sys_mmap_pgoff() the given length is passed to vm_mmap_pgoff() as it is without being aligned with hugepage boundary. This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix alignment of huge page requests"), where alignment code is pushed into hugetlb_file_setup() and the variable len in caller side is not changed. The change in commit af73e4d9506d was added because causing mmap to return -EINVAL if length is not a multiple of huge page size was considered a regression. It would still be considered a regression today. I understand that the behavior not consistent. However, it is clearly documented. I do not believe we can change the behavior of this code.
On 2020-03-31 at 02:39 Mike Kravetz wrote: >On 3/29/20 1:09 AM, Li Xinhai wrote: >> On 2020-03-29 at 11:53 John Hubbard wrote: >>> On 3/28/20 8:08 PM, Li Xinhai wrote: >>>> In current code, the vma related call of hugetlb mapping, except mmap, >>>> are all consider not correctly aligned length as invalid parameter, >>>> including mprotect,munmap, mlock, etc., by checking through >>>> hugetlb_vm_op_split. So, user will see failure, after successfully call >>>> mmap, although using same length parameter to other mapping syscall. >>>> >>>> It is desirable for all hugetlb mapping calls have consistent behavior, >>>> without mmap as exception(which round up length to align underlying >>>> hugepage size). In current Documentation/admin-guide/mm/hugetlbpage.rst, >>>> the description is: >>>> " >>>> Syscalls that operate on memory backed by hugetlb pages only have their >>>> lengths aligned to the native page size of the processor; they will >>>> normally fail with errno set to EINVAL or exclude hugetlb pages that >>>> extend beyond the length if not hugepage aligned. For example, munmap(2) >>>> will fail if memory is backed by a hugetlb page and the length is smaller >>>> than the hugepage size. >>>> " >>>> which express the consistent behavior. >>> >>> >>> Missing here is a description of what the patch actually does... >>> >> >> right, more statement can be added like: >> " >> After this patch, all hugetlb mapping related syscall wil only align >> length parameter to the native page size of the processor. For mmap(), >> hugetlb_get_unmmaped_area() will set errno to EINVAL if length is not >> aligned to underlying hugepage size. >> " >> >>>> >>>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> >>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>> Cc: Mike Kravetz <mike.kravetz@oracle.com> >>>> Cc: John Hubbard <jhubbard@nvidia.com> >>>> --- >>>> changes: >>>> 0. patch which introduce new flag for mmap() >>>> The new flag should be avoided. >>>> https://lore.kernel.org/linux-mm/1585313944-8627-1-git-send-email-lixinhai.lxh@gmail.com/ > >It is not exactly clear in your commit message, but this change will cause >mmap() of hugetlb ranges to fail (-EINVAL) if length is not a multiple of >huge page size. The mmap man page says: > > Huge page (Huge TLB) mappings > For mappings that employ huge pages, the requirements for the arguments > of mmap() and munmap() differ somewhat from the requirements for map‐ > pings that use the native system page size. > > For mmap(), offset must be a multiple of the underlying huge page size. > The system automatically aligns length to be a multiple of the underly‐ > ing huge page size. > > For munmap(), addr and length must both be a multiple of the underlying > huge page size. > >So this change may cause application failure. The code you are removing was >added with commit af73e4d9506d. The commit message for that commit says: > > hugetlbfs: fix mmap failure in unaligned size request > > The current kernel returns -EINVAL unless a given mmap length is > "almost" hugepage aligned. This is because in sys_mmap_pgoff() the > given length is passed to vm_mmap_pgoff() as it is without being aligned > with hugepage boundary. > > This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix > alignment of huge page requests"), where alignment code is pushed into > hugetlb_file_setup() and the variable len in caller side is not changed. > >The change in commit af73e4d9506d was added because causing mmap to return >-EINVAL if length is not a multiple of huge page size was considered a >regression. It would still be considered a regression today. > Agree, it would casue regression today if those user space application still work in that way. After read through the bug report page, it is indeed for some applications want to use not aligned size for mmap(), but don't care what will happen if that size been used in subsequent calls. My understanding may wrong, but it seems that once some application start to use some behavior of kernel, although that usage in user space is not logical, they will be protected from change in kernel side. >I understand that the behavior not consistent. However, it is clearly >documented. I do not believe we can change the behavior of this code. > >-- >Mike Kravetz
On 3/31/20 1:35 AM, Li Xinhai wrote: > My understanding may wrong, but it seems that once some application start to use > some behavior of kernel, although that usage in user space is not logical, they > will be protected from change in kernel side. Correct. I too wish that the length argument to mmap for hugetlb mappings was required to be a multiple of huge page size. That would make everything nice and consistent. However, the behavior of rounding the length up to huge page size has existed for quite some time and it well documented. Therefore, we can not change it without the possibility of breaking some application.
diff --git a/mm/mmap.c b/mm/mmap.c index d681a20..b2aa102 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1560,20 +1560,12 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, file = fget(fd); if (!file) return -EBADF; - if (is_file_hugepages(file)) - len = ALIGN(len, huge_page_size(hstate_file(file))); retval = -EINVAL; if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file))) goto out_fput; } else if (flags & MAP_HUGETLB) { struct user_struct *user = NULL; - struct hstate *hs; - hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); - if (!hs) - return -EINVAL; - - len = ALIGN(len, huge_page_size(hs)); /* * VM_NORESERVE is used because the reservations will be * taken when vm_ops->mmap() is called
In current code, the vma related call of hugetlb mapping, except mmap, are all consider not correctly aligned length as invalid parameter, including mprotect,munmap, mlock, etc., by checking through hugetlb_vm_op_split. So, user will see failure, after successfully call mmap, although using same length parameter to other mapping syscall. It is desirable for all hugetlb mapping calls have consistent behavior, without mmap as exception(which round up length to align underlying hugepage size). In current Documentation/admin-guide/mm/hugetlbpage.rst, the description is: " Syscalls that operate on memory backed by hugetlb pages only have their lengths aligned to the native page size of the processor; they will normally fail with errno set to EINVAL or exclude hugetlb pages that extend beyond the length if not hugepage aligned. For example, munmap(2) will fail if memory is backed by a hugetlb page and the length is smaller than the hugepage size. " which express the consistent behavior. Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> --- changes: 0. patch which introduce new flag for mmap() The new flag should be avoided. https://lore.kernel.org/linux-mm/1585313944-8627-1-git-send-email-lixinhai.lxh@gmail.com/ mm/mmap.c | 8 -------- 1 file changed, 8 deletions(-)