Message ID | 3b92d56f55671a0389252379237703df6e86ea48.1682464032.git.lstoakes@gmail.com (mailing list archive) |
---|---|
State | Mainlined, archived |
Headers | show |
Series | [v4] mm/gup: disallow GUP writing to file-backed mappings by default | expand |
Hi, On 26.4.2023 2.15, Lorenzo Stoakes wrote: > GUP does not correctly implement write-notify semantics, nor does it > guarantee that the underlying pages are correctly dirtied, which could lead > to a kernel oops or data corruption when writing to file-backed mappings. > > This is only relevant when the mappings are file-backed and the underlying > file system requires folio dirty tracking. File systems which do not, such > as shmem or hugetlb, are not at risk and therefore can be written to > without issue. > > Unfortunately this limitation of GUP has been present for some time and > requires future rework of the GUP API in order to provide correct write > access to such mappings. > > In the meantime, we add a check for the most broken GUP case - > FOLL_LONGTERM - which really under no circumstances can safely access > dirty-tracked file mappings. > > As part of this change we separate out vma_needs_dirty_tracking() as a > helper function to determine this, which is distinct from > vma_wants_writenotify() which is specific to determining which PTE flags to > set. > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > --- > v4: > - Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to reduce > duplication and update to use this in the GUP check. Note that both separately > check vm_ops_needs_writenotify() as the latter needs to test this before the > vm_pgprot_modify() test, resulting in vma_wants_writenotify() checking this > twice, however it is such a small check this should not be egregious. > > v3: > - Rebased on latest mm-unstable as of 24th April 2023. > - Explicitly check whether file system requires folio dirtying. Note that > vma_wants_writenotify() could not be used directly as it is very much focused > on determining if the PTE r/w should be set (e.g. assuming private mapping > does not require it as already set, soft dirty considerations). > - Tested code against shmem and hugetlb mappings - confirmed that these are not > disallowed by the check. > - Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only > for FOLL_LONGTERM pins. > - As a result, limit check to internal GUP code. > https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/ > > v2: > - Add accidentally excluded ptrace_access_vm() use of > FOLL_ALLOW_BROKEN_FILE_MAPPING. > - Tweak commit message. > https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/ > > v1: > https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/ > > include/linux/mm.h | 1 + > mm/gup.c | 26 +++++++++++++++++++++++++- > mm/mmap.c | 37 ++++++++++++++++++++++++++++--------- > 3 files changed, 54 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 37554b08bb28..f7da02fc89c6 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > MM_CP_UFFD_WP_RESOLVE) > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > { > diff --git a/mm/gup.c b/mm/gup.c > index 1f72a717232b..53652453037c 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -959,16 +959,37 @@ static int faultin_page(struct vm_area_struct *vma, > return 0; > } > > +/* > + * Writing to file-backed mappings which require folio dirty tracking using GUP > + * is a fundamentally broken operation as kernel write access to GUP mappings > + * may not adhere to the semantics expected by a file system. > + */ > +static inline bool can_write_file_mapping(struct vm_area_struct *vma, > + unsigned long gup_flags) > +{ > + /* If we aren't pinning then no problematic write can occur. */ > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > + return true; > + > + /* We limit this check to the most egregious case - a long term pin. */ > + if (!(gup_flags & FOLL_LONGTERM)) > + return true; > + > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > + return vma_needs_dirty_tracking(vma); > +} > + > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > { > vm_flags_t vm_flags = vma->vm_flags; > int write = (gup_flags & FOLL_WRITE); > int foreign = (gup_flags & FOLL_REMOTE); > + bool vma_anon = vma_is_anonymous(vma); > > if (vm_flags & (VM_IO | VM_PFNMAP)) > return -EFAULT; > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > + if ((gup_flags & FOLL_ANON) && !vma_anon) > return -EFAULT; > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > @@ -978,6 +999,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > return -EFAULT; > > if (write) { > + if (!vma_anon && !can_write_file_mapping(vma, gup_flags)) > + return -EFAULT; > + > if (!(vm_flags & VM_WRITE)) { > if (!(gup_flags & FOLL_FORCE)) > return -EFAULT; > diff --git a/mm/mmap.c b/mm/mmap.c > index 536bbb8fa0ae..aac638dd22cf 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1475,6 +1475,32 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > } > #endif /* __ARCH_WANT_SYS_OLD_MMAP */ > > +/* Do VMA operations imply write notify is required? */ > +static inline bool vm_ops_needs_writenotify( > + const struct vm_operations_struct *vm_ops) > +{ > + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); > +} > + > +/* > + * Does this VMA require the underlying folios to have their dirty state > + * tracked? > + */ > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) > +{ > + /* Does the filesystem need to be notified? */ > + if (vm_ops_needs_writenotify(vma->vm_ops)) > + return true; > + > + /* Specialty mapping? */ > + if (vma->vm_flags & VM_PFNMAP) > + return false; > + > + /* Can the mapping track the dirty pages? */ > + return vma->vm_file && vma->vm_file->f_mapping && > + mapping_can_writeback(vma->vm_file->f_mapping); > +} > + What would be the exact reproducer of the problem? AFAIK writenotify is handled (by handle_mm_fault()) for non cow mappings (shared), where it only matters. GUP will only allow FOLL_FORCE without faulting for PageAnonExclusive pages. So if you want something beyond normal cow semantics you have custom vm_ops (and mmap() and fault()) Also for longterm pinning gups vs fork vs swap there has been fixes by david recently. > /* > * Some shared mappings will want the pages marked read-only > * to track write events. If so, we'll downgrade vm_page_prot > @@ -1484,14 +1510,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > { > vm_flags_t vm_flags = vma->vm_flags; > - const struct vm_operations_struct *vm_ops = vma->vm_ops; > > /* If it was private or non-writable, the write bit is already clear */ > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > return 0; > > /* The backer wishes to know when pages are first written to? */ > - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) > + if (vm_ops_needs_writenotify(vma->vm_ops)) > return 1; > > /* The open routine did something to the protections that pgprot_modify > @@ -1511,13 +1536,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > if (userfaultfd_wp(vma)) > return 1; > > - /* Specialty mapping? */ > - if (vm_flags & VM_PFNMAP) > - return 0; > - > - /* Can the mapping track the dirty pages? */ > - return vma->vm_file && vma->vm_file->f_mapping && > - mapping_can_writeback(vma->vm_file->f_mapping); > + return vma_needs_dirty_tracking(vma); > } > > /* --Mika
On Wed, Apr 26, 2023 at 06:18:38AM +0300, Mika Penttilä wrote: > Hi, > > > On 26.4.2023 2.15, Lorenzo Stoakes wrote: > > GUP does not correctly implement write-notify semantics, nor does it > > guarantee that the underlying pages are correctly dirtied, which could lead > > to a kernel oops or data corruption when writing to file-backed mappings. > > > > This is only relevant when the mappings are file-backed and the underlying > > file system requires folio dirty tracking. File systems which do not, such > > as shmem or hugetlb, are not at risk and therefore can be written to > > without issue. > > > > Unfortunately this limitation of GUP has been present for some time and > > requires future rework of the GUP API in order to provide correct write > > access to such mappings. > > > > In the meantime, we add a check for the most broken GUP case - > > FOLL_LONGTERM - which really under no circumstances can safely access > > dirty-tracked file mappings. > > > > As part of this change we separate out vma_needs_dirty_tracking() as a > > helper function to determine this, which is distinct from > > vma_wants_writenotify() which is specific to determining which PTE flags to > > set. > > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > > --- > > v4: > > - Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to reduce > > duplication and update to use this in the GUP check. Note that both separately > > check vm_ops_needs_writenotify() as the latter needs to test this before the > > vm_pgprot_modify() test, resulting in vma_wants_writenotify() checking this > > twice, however it is such a small check this should not be egregious. > > > > v3: > > - Rebased on latest mm-unstable as of 24th April 2023. > > - Explicitly check whether file system requires folio dirtying. Note that > > vma_wants_writenotify() could not be used directly as it is very much focused > > on determining if the PTE r/w should be set (e.g. assuming private mapping > > does not require it as already set, soft dirty considerations). > > - Tested code against shmem and hugetlb mappings - confirmed that these are not > > disallowed by the check. > > - Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only > > for FOLL_LONGTERM pins. > > - As a result, limit check to internal GUP code. > > https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/ > > > > v2: > > - Add accidentally excluded ptrace_access_vm() use of > > FOLL_ALLOW_BROKEN_FILE_MAPPING. > > - Tweak commit message. > > https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/ > > > > v1: > > https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/ > > > > include/linux/mm.h | 1 + > > mm/gup.c | 26 +++++++++++++++++++++++++- > > mm/mmap.c | 37 ++++++++++++++++++++++++++++--------- > > 3 files changed, 54 insertions(+), 10 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 37554b08bb28..f7da02fc89c6 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > > MM_CP_UFFD_WP_RESOLVE) > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > > { > > diff --git a/mm/gup.c b/mm/gup.c > > index 1f72a717232b..53652453037c 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -959,16 +959,37 @@ static int faultin_page(struct vm_area_struct *vma, > > return 0; > > } > > +/* > > + * Writing to file-backed mappings which require folio dirty tracking using GUP > > + * is a fundamentally broken operation as kernel write access to GUP mappings > > + * may not adhere to the semantics expected by a file system. > > + */ > > +static inline bool can_write_file_mapping(struct vm_area_struct *vma, > > + unsigned long gup_flags) > > +{ > > + /* If we aren't pinning then no problematic write can occur. */ > > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > > + return true; > > + > > + /* We limit this check to the most egregious case - a long term pin. */ > > + if (!(gup_flags & FOLL_LONGTERM)) > > + return true; > > + > > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > > + return vma_needs_dirty_tracking(vma); > > +} > > + > > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > > { > > vm_flags_t vm_flags = vma->vm_flags; > > int write = (gup_flags & FOLL_WRITE); > > int foreign = (gup_flags & FOLL_REMOTE); > > + bool vma_anon = vma_is_anonymous(vma); > > if (vm_flags & (VM_IO | VM_PFNMAP)) > > return -EFAULT; > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > > + if ((gup_flags & FOLL_ANON) && !vma_anon) > > return -EFAULT; > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > > @@ -978,6 +999,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > > return -EFAULT; > > if (write) { > > + if (!vma_anon && !can_write_file_mapping(vma, gup_flags)) > > + return -EFAULT; > > + > > if (!(vm_flags & VM_WRITE)) { > > if (!(gup_flags & FOLL_FORCE)) > > return -EFAULT; > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 536bbb8fa0ae..aac638dd22cf 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -1475,6 +1475,32 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > > } > > #endif /* __ARCH_WANT_SYS_OLD_MMAP */ > > +/* Do VMA operations imply write notify is required? */ > > +static inline bool vm_ops_needs_writenotify( > > + const struct vm_operations_struct *vm_ops) > > +{ > > + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); > > +} > > + > > +/* > > + * Does this VMA require the underlying folios to have their dirty state > > + * tracked? > > + */ > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) > > +{ > > + /* Does the filesystem need to be notified? */ > > + if (vm_ops_needs_writenotify(vma->vm_ops)) > > + return true; > > + > > + /* Specialty mapping? */ > > + if (vma->vm_flags & VM_PFNMAP) > > + return false; > > + > > + /* Can the mapping track the dirty pages? */ > > + return vma->vm_file && vma->vm_file->f_mapping && > > + mapping_can_writeback(vma->vm_file->f_mapping); > > +} > > + > > What would be the exact reproducer of the problem? AFAIK writenotify is > handled (by handle_mm_fault()) for non cow mappings (shared), where it only > matters. The issue is reproduced simply by page_to_virt(pinned_page)[0] = 'x' :) The problem is that no faulting actually occurs, so no writenotify, and no PG_dirty tracking does either. Unexpected page dirtying can occur even after they are cleaned in folio_clear_dirty_for_io(), because the caller might manually mark the page dirty at an unexpected time as with the unpin_*dirty*() helpers. I think the long-term solution is to provide a different interface where pages are passed back briefly with locks held and with a manual invocation of writeprotect, or perhaps some kthread_use_mm() thing so we actually trigger the faulting logic, but in the meantime this change helps restore some sanity. > > GUP will only allow FOLL_FORCE without faulting for PageAnonExclusive pages. > So if you want something beyond normal cow semantics you have custom vm_ops > (and mmap() and fault()) This has nothing to do with FOLL_FORCE. > > Also for longterm pinning gups vs fork vs swap there has been fixes by david > recently. I don't think these are relevant in any way to this issue. > > > > > /* > > * Some shared mappings will want the pages marked read-only > > * to track write events. If so, we'll downgrade vm_page_prot > > @@ -1484,14 +1510,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > > { > > vm_flags_t vm_flags = vma->vm_flags; > > - const struct vm_operations_struct *vm_ops = vma->vm_ops; > > /* If it was private or non-writable, the write bit is already clear */ > > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > > return 0; > > /* The backer wishes to know when pages are first written to? */ > > - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) > > + if (vm_ops_needs_writenotify(vma->vm_ops)) > > return 1; > > /* The open routine did something to the protections that pgprot_modify > > @@ -1511,13 +1536,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > > if (userfaultfd_wp(vma)) > > return 1; > > - /* Specialty mapping? */ > > - if (vm_flags & VM_PFNMAP) > > - return 0; > > - > > - /* Can the mapping track the dirty pages? */ > > - return vma->vm_file && vma->vm_file->f_mapping && > > - mapping_can_writeback(vma->vm_file->f_mapping); > > + return vma_needs_dirty_tracking(vma); > > } > > /* > > > --Mika >
On 26.4.2023 10.00, Lorenzo Stoakes wrote: > On Wed, Apr 26, 2023 at 06:18:38AM +0300, Mika Penttilä wrote: >> Hi, >> >> >> On 26.4.2023 2.15, Lorenzo Stoakes wrote: >>> GUP does not correctly implement write-notify semantics, nor does it >>> guarantee that the underlying pages are correctly dirtied, which could lead >>> to a kernel oops or data corruption when writing to file-backed mappings. >>> >>> This is only relevant when the mappings are file-backed and the underlying >>> file system requires folio dirty tracking. File systems which do not, such >>> as shmem or hugetlb, are not at risk and therefore can be written to >>> without issue. >>> >>> Unfortunately this limitation of GUP has been present for some time and >>> requires future rework of the GUP API in order to provide correct write >>> access to such mappings. >>> >>> In the meantime, we add a check for the most broken GUP case - >>> FOLL_LONGTERM - which really under no circumstances can safely access >>> dirty-tracked file mappings. >>> >>> As part of this change we separate out vma_needs_dirty_tracking() as a >>> helper function to determine this, which is distinct from >>> vma_wants_writenotify() which is specific to determining which PTE flags to >>> set. >>> >>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> >>> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> >>> --- >>> v4: >>> - Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to reduce >>> duplication and update to use this in the GUP check. Note that both separately >>> check vm_ops_needs_writenotify() as the latter needs to test this before the >>> vm_pgprot_modify() test, resulting in vma_wants_writenotify() checking this >>> twice, however it is such a small check this should not be egregious. >>> >>> v3: >>> - Rebased on latest mm-unstable as of 24th April 2023. >>> - Explicitly check whether file system requires folio dirtying. Note that >>> vma_wants_writenotify() could not be used directly as it is very much focused >>> on determining if the PTE r/w should be set (e.g. assuming private mapping >>> does not require it as already set, soft dirty considerations). >>> - Tested code against shmem and hugetlb mappings - confirmed that these are not >>> disallowed by the check. >>> - Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only >>> for FOLL_LONGTERM pins. >>> - As a result, limit check to internal GUP code. >>> https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/ >>> >>> v2: >>> - Add accidentally excluded ptrace_access_vm() use of >>> FOLL_ALLOW_BROKEN_FILE_MAPPING. >>> - Tweak commit message. >>> https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/ >>> >>> v1: >>> https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/ >>> >>> include/linux/mm.h | 1 + >>> mm/gup.c | 26 +++++++++++++++++++++++++- >>> mm/mmap.c | 37 ++++++++++++++++++++++++++++--------- >>> 3 files changed, 54 insertions(+), 10 deletions(-) >>> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>> index 37554b08bb28..f7da02fc89c6 100644 >>> --- a/include/linux/mm.h >>> +++ b/include/linux/mm.h >>> @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, >>> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ >>> MM_CP_UFFD_WP_RESOLVE) >>> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); >>> int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); >>> static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) >>> { >>> diff --git a/mm/gup.c b/mm/gup.c >>> index 1f72a717232b..53652453037c 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -959,16 +959,37 @@ static int faultin_page(struct vm_area_struct *vma, >>> return 0; >>> } >>> +/* >>> + * Writing to file-backed mappings which require folio dirty tracking using GUP >>> + * is a fundamentally broken operation as kernel write access to GUP mappings >>> + * may not adhere to the semantics expected by a file system. >>> + */ >>> +static inline bool can_write_file_mapping(struct vm_area_struct *vma, >>> + unsigned long gup_flags) >>> +{ >>> + /* If we aren't pinning then no problematic write can occur. */ >>> + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) >>> + return true; >>> + >>> + /* We limit this check to the most egregious case - a long term pin. */ >>> + if (!(gup_flags & FOLL_LONGTERM)) >>> + return true; >>> + >>> + /* If the VMA requires dirty tracking then GUP will be problematic. */ >>> + return vma_needs_dirty_tracking(vma); >>> +} >>> + >>> static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) >>> { >>> vm_flags_t vm_flags = vma->vm_flags; >>> int write = (gup_flags & FOLL_WRITE); >>> int foreign = (gup_flags & FOLL_REMOTE); >>> + bool vma_anon = vma_is_anonymous(vma); >>> if (vm_flags & (VM_IO | VM_PFNMAP)) >>> return -EFAULT; >>> - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) >>> + if ((gup_flags & FOLL_ANON) && !vma_anon) >>> return -EFAULT; >>> if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) >>> @@ -978,6 +999,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) >>> return -EFAULT; >>> if (write) { >>> + if (!vma_anon && !can_write_file_mapping(vma, gup_flags)) >>> + return -EFAULT; >>> + >>> if (!(vm_flags & VM_WRITE)) { >>> if (!(gup_flags & FOLL_FORCE)) >>> return -EFAULT; >>> diff --git a/mm/mmap.c b/mm/mmap.c >>> index 536bbb8fa0ae..aac638dd22cf 100644 >>> --- a/mm/mmap.c >>> +++ b/mm/mmap.c >>> @@ -1475,6 +1475,32 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) >>> } >>> #endif /* __ARCH_WANT_SYS_OLD_MMAP */ >>> +/* Do VMA operations imply write notify is required? */ >>> +static inline bool vm_ops_needs_writenotify( >>> + const struct vm_operations_struct *vm_ops) >>> +{ >>> + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); >>> +} >>> + >>> +/* >>> + * Does this VMA require the underlying folios to have their dirty state >>> + * tracked? >>> + */ >>> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) >>> +{ >>> + /* Does the filesystem need to be notified? */ >>> + if (vm_ops_needs_writenotify(vma->vm_ops)) >>> + return true; >>> + >>> + /* Specialty mapping? */ >>> + if (vma->vm_flags & VM_PFNMAP) >>> + return false; >>> + >>> + /* Can the mapping track the dirty pages? */ >>> + return vma->vm_file && vma->vm_file->f_mapping && >>> + mapping_can_writeback(vma->vm_file->f_mapping); >>> +} >>> + >> >> What would be the exact reproducer of the problem? AFAIK writenotify is >> handled (by handle_mm_fault()) for non cow mappings (shared), where it only >> matters. > > The issue is reproduced simply by page_to_virt(pinned_page)[0] = 'x' :) > > The problem is that no faulting actually occurs, so no writenotify, and no Could you elaborate? GUP calls handle_mm_fault() that invokes the write notify the pte is made first writable. Of course, virt(pinned_page)[0] = 'x' is not supposed to fault while using the kernel mapping. > PG_dirty tracking does either. Unexpected page dirtying can occur even > after they are cleaned in folio_clear_dirty_for_io(), because the caller > might manually mark the page dirty at an unexpected time as with the > unpin_*dirty*() helpers. > > I think the long-term solution is to provide a different interface where > pages are passed back briefly with locks held and with a manual invocation > of writeprotect, or perhaps some kthread_use_mm() thing so we actually > trigger the faulting logic, but in the meantime this change helps restore > some sanity. > >> >> GUP will only allow FOLL_FORCE without faulting for PageAnonExclusive pages. >> So if you want something beyond normal cow semantics you have custom vm_ops >> (and mmap() and fault()) > > This has nothing to do with FOLL_FORCE. > >> >> Also for longterm pinning gups vs fork vs swap there has been fixes by david >> recently. > > I don't think these are relevant in any way to this issue. > >> >> >> >>> /* >>> * Some shared mappings will want the pages marked read-only >>> * to track write events. If so, we'll downgrade vm_page_prot >>> @@ -1484,14 +1510,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) >>> int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) >>> { >>> vm_flags_t vm_flags = vma->vm_flags; >>> - const struct vm_operations_struct *vm_ops = vma->vm_ops; >>> /* If it was private or non-writable, the write bit is already clear */ >>> if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) >>> return 0; >>> /* The backer wishes to know when pages are first written to? */ >>> - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) >>> + if (vm_ops_needs_writenotify(vma->vm_ops)) >>> return 1; >>> /* The open routine did something to the protections that pgprot_modify >>> @@ -1511,13 +1536,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) >>> if (userfaultfd_wp(vma)) >>> return 1; >>> - /* Specialty mapping? */ >>> - if (vm_flags & VM_PFNMAP) >>> - return 0; >>> - >>> - /* Can the mapping track the dirty pages? */ >>> - return vma->vm_file && vma->vm_file->f_mapping && >>> - mapping_can_writeback(vma->vm_file->f_mapping); >>> + return vma_needs_dirty_tracking(vma); >>> } >>> /* >> >> >> --Mika >> >
On Wed, Apr 26, 2023 at 10:10:14AM +0300, Mika Penttilä wrote: > > > On 26.4.2023 10.00, Lorenzo Stoakes wrote: > > On Wed, Apr 26, 2023 at 06:18:38AM +0300, Mika Penttilä wrote: > > > Hi, > > > > > > > > > On 26.4.2023 2.15, Lorenzo Stoakes wrote: > > > > GUP does not correctly implement write-notify semantics, nor does it > > > > guarantee that the underlying pages are correctly dirtied, which could lead > > > > to a kernel oops or data corruption when writing to file-backed mappings. > > > > > > > > This is only relevant when the mappings are file-backed and the underlying > > > > file system requires folio dirty tracking. File systems which do not, such > > > > as shmem or hugetlb, are not at risk and therefore can be written to > > > > without issue. > > > > > > > > Unfortunately this limitation of GUP has been present for some time and > > > > requires future rework of the GUP API in order to provide correct write > > > > access to such mappings. > > > > > > > > In the meantime, we add a check for the most broken GUP case - > > > > FOLL_LONGTERM - which really under no circumstances can safely access > > > > dirty-tracked file mappings. > > > > > > > > As part of this change we separate out vma_needs_dirty_tracking() as a > > > > helper function to determine this, which is distinct from > > > > vma_wants_writenotify() which is specific to determining which PTE flags to > > > > set. > > > > > > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > > > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > > > > --- > > > > v4: > > > > - Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to reduce > > > > duplication and update to use this in the GUP check. Note that both separately > > > > check vm_ops_needs_writenotify() as the latter needs to test this before the > > > > vm_pgprot_modify() test, resulting in vma_wants_writenotify() checking this > > > > twice, however it is such a small check this should not be egregious. > > > > > > > > v3: > > > > - Rebased on latest mm-unstable as of 24th April 2023. > > > > - Explicitly check whether file system requires folio dirtying. Note that > > > > vma_wants_writenotify() could not be used directly as it is very much focused > > > > on determining if the PTE r/w should be set (e.g. assuming private mapping > > > > does not require it as already set, soft dirty considerations). > > > > - Tested code against shmem and hugetlb mappings - confirmed that these are not > > > > disallowed by the check. > > > > - Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only > > > > for FOLL_LONGTERM pins. > > > > - As a result, limit check to internal GUP code. > > > > https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/ > > > > > > > > v2: > > > > - Add accidentally excluded ptrace_access_vm() use of > > > > FOLL_ALLOW_BROKEN_FILE_MAPPING. > > > > - Tweak commit message. > > > > https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/ > > > > > > > > v1: > > > > https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/ > > > > > > > > include/linux/mm.h | 1 + > > > > mm/gup.c | 26 +++++++++++++++++++++++++- > > > > mm/mmap.c | 37 ++++++++++++++++++++++++++++--------- > > > > 3 files changed, 54 insertions(+), 10 deletions(-) > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > index 37554b08bb28..f7da02fc89c6 100644 > > > > --- a/include/linux/mm.h > > > > +++ b/include/linux/mm.h > > > > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > > > > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > > > > MM_CP_UFFD_WP_RESOLVE) > > > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > > > > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > > > > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > > > > { > > > > diff --git a/mm/gup.c b/mm/gup.c > > > > index 1f72a717232b..53652453037c 100644 > > > > --- a/mm/gup.c > > > > +++ b/mm/gup.c > > > > @@ -959,16 +959,37 @@ static int faultin_page(struct vm_area_struct *vma, > > > > return 0; > > > > } > > > > +/* > > > > + * Writing to file-backed mappings which require folio dirty tracking using GUP > > > > + * is a fundamentally broken operation as kernel write access to GUP mappings > > > > + * may not adhere to the semantics expected by a file system. > > > > + */ > > > > +static inline bool can_write_file_mapping(struct vm_area_struct *vma, > > > > + unsigned long gup_flags) > > > > +{ > > > > + /* If we aren't pinning then no problematic write can occur. */ > > > > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > > > > + return true; > > > > + > > > > + /* We limit this check to the most egregious case - a long term pin. */ > > > > + if (!(gup_flags & FOLL_LONGTERM)) > > > > + return true; > > > > + > > > > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > > > > + return vma_needs_dirty_tracking(vma); > > > > +} > > > > + > > > > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > > > > { > > > > vm_flags_t vm_flags = vma->vm_flags; > > > > int write = (gup_flags & FOLL_WRITE); > > > > int foreign = (gup_flags & FOLL_REMOTE); > > > > + bool vma_anon = vma_is_anonymous(vma); > > > > if (vm_flags & (VM_IO | VM_PFNMAP)) > > > > return -EFAULT; > > > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > > > > + if ((gup_flags & FOLL_ANON) && !vma_anon) > > > > return -EFAULT; > > > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > > > > @@ -978,6 +999,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > > > > return -EFAULT; > > > > if (write) { > > > > + if (!vma_anon && !can_write_file_mapping(vma, gup_flags)) > > > > + return -EFAULT; > > > > + > > > > if (!(vm_flags & VM_WRITE)) { > > > > if (!(gup_flags & FOLL_FORCE)) > > > > return -EFAULT; > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > index 536bbb8fa0ae..aac638dd22cf 100644 > > > > --- a/mm/mmap.c > > > > +++ b/mm/mmap.c > > > > @@ -1475,6 +1475,32 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > > > > } > > > > #endif /* __ARCH_WANT_SYS_OLD_MMAP */ > > > > +/* Do VMA operations imply write notify is required? */ > > > > +static inline bool vm_ops_needs_writenotify( > > > > + const struct vm_operations_struct *vm_ops) > > > > +{ > > > > + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); > > > > +} > > > > + > > > > +/* > > > > + * Does this VMA require the underlying folios to have their dirty state > > > > + * tracked? > > > > + */ > > > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) > > > > +{ > > > > + /* Does the filesystem need to be notified? */ > > > > + if (vm_ops_needs_writenotify(vma->vm_ops)) > > > > + return true; > > > > + > > > > + /* Specialty mapping? */ > > > > + if (vma->vm_flags & VM_PFNMAP) > > > > + return false; > > > > + > > > > + /* Can the mapping track the dirty pages? */ > > > > + return vma->vm_file && vma->vm_file->f_mapping && > > > > + mapping_can_writeback(vma->vm_file->f_mapping); > > > > +} > > > > + > > > > > > What would be the exact reproducer of the problem? AFAIK writenotify is > > > handled (by handle_mm_fault()) for non cow mappings (shared), where it only > > > matters. > > > > The issue is reproduced simply by page_to_virt(pinned_page)[0] = 'x' :) > > > > The problem is that no faulting actually occurs, so no writenotify, and no > > > Could you elaborate? GUP calls handle_mm_fault() that invokes the write > notify the pte is made first writable. Of course, virt(pinned_page)[0] = 'x' > is not supposed to fault while using the kernel mapping. > The issue is how dirtying works. Typically for a dirty-tracking mapping the kernel makes the mapping read-only, then when a write fault occurs, writenotify is called and the folio is marked dirty. This way the file system knows which files to writeback, then after writeback it 'cleans' them, restoring the read-only mapping and relying on the NEXT write marking write notifying and marking the folio dirty again. If we GUP, _especially_ if it's long term, we run the risk of a write to the folio _after_ it has been cleaned and if the caller tries to do the 'right thing' and mark the folio dirty, it being marked dirty at an unexpected time which might race with other things and thus oops. The issue is that this dirtying mechanism implicitly relies on the memory _only_ being accessed via user mappings, but we're providing another way to access that memory bypassing all that. It's been a fundamental flaw in GUP for some time. This change tries to make things a little better by precluding the riskiest version of this which is the caller indicating that the pin is longterm (via FOLL_LONGTERM). For an example of a user trying to sensibly avoid this, see io_pin_pages() in io_uring/rsrc.c. This is where io_uring tries to explicitly avoid this themselves, something that GUP should clearly be doing. After this change, that code can be removed and we will live in a world where linux has a saner GUP :) Of course we need to make more fundamental changes moving forward, the idea is this improves the situation and eliminates the need for the open-coded solution for io_uring which unblocks my other patch series which is also trying to make GUP more sane. > > > > PG_dirty tracking does either. Unexpected page dirtying can occur even > > after they are cleaned in folio_clear_dirty_for_io(), because the caller > > might manually mark the page dirty at an unexpected time as with the > > unpin_*dirty*() helpers. > > > > I think the long-term solution is to provide a different interface where > > pages are passed back briefly with locks held and with a manual invocation > > of writeprotect, or perhaps some kthread_use_mm() thing so we actually > > trigger the faulting logic, but in the meantime this change helps restore > > some sanity. > > > > > > > > GUP will only allow FOLL_FORCE without faulting for PageAnonExclusive pages. > > > So if you want something beyond normal cow semantics you have custom vm_ops > > > (and mmap() and fault()) > > > > This has nothing to do with FOLL_FORCE. > > > > > > > > Also for longterm pinning gups vs fork vs swap there has been fixes by david > > > recently. > > > > I don't think these are relevant in any way to this issue. > > > > > > > > > > > > > > > /* > > > > * Some shared mappings will want the pages marked read-only > > > > * to track write events. If so, we'll downgrade vm_page_prot > > > > @@ -1484,14 +1510,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > > > > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > > > > { > > > > vm_flags_t vm_flags = vma->vm_flags; > > > > - const struct vm_operations_struct *vm_ops = vma->vm_ops; > > > > /* If it was private or non-writable, the write bit is already clear */ > > > > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > > > > return 0; > > > > /* The backer wishes to know when pages are first written to? */ > > > > - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) > > > > + if (vm_ops_needs_writenotify(vma->vm_ops)) > > > > return 1; > > > > /* The open routine did something to the protections that pgprot_modify > > > > @@ -1511,13 +1536,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > > > > if (userfaultfd_wp(vma)) > > > > return 1; > > > > - /* Specialty mapping? */ > > > > - if (vm_flags & VM_PFNMAP) > > > > - return 0; > > > > - > > > > - /* Can the mapping track the dirty pages? */ > > > > - return vma->vm_file && vma->vm_file->f_mapping && > > > > - mapping_can_writeback(vma->vm_file->f_mapping); > > > > + return vma_needs_dirty_tracking(vma); > > > > } > > > > /* > > > > > > > > > --Mika > > > > > > >
On 26.4.2023 10.20, Lorenzo Stoakes wrote: > On Wed, Apr 26, 2023 at 10:10:14AM +0300, Mika Penttilä wrote: >> >> >> On 26.4.2023 10.00, Lorenzo Stoakes wrote: >>> On Wed, Apr 26, 2023 at 06:18:38AM +0300, Mika Penttilä wrote: >>>> Hi, >>>> >>>> >>>> On 26.4.2023 2.15, Lorenzo Stoakes wrote: >>>>> GUP does not correctly implement write-notify semantics, nor does it >>>>> guarantee that the underlying pages are correctly dirtied, which could lead >>>>> to a kernel oops or data corruption when writing to file-backed mappings. >>>>> >>>>> This is only relevant when the mappings are file-backed and the underlying >>>>> file system requires folio dirty tracking. File systems which do not, such >>>>> as shmem or hugetlb, are not at risk and therefore can be written to >>>>> without issue. >>>>> >>>>> Unfortunately this limitation of GUP has been present for some time and >>>>> requires future rework of the GUP API in order to provide correct write >>>>> access to such mappings. >>>>> >>>>> In the meantime, we add a check for the most broken GUP case - >>>>> FOLL_LONGTERM - which really under no circumstances can safely access >>>>> dirty-tracked file mappings. >>>>> >>>>> As part of this change we separate out vma_needs_dirty_tracking() as a >>>>> helper function to determine this, which is distinct from >>>>> vma_wants_writenotify() which is specific to determining which PTE flags to >>>>> set. >>>>> >>>>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> >>>>> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> >>>>> --- >>>>> v4: >>>>> - Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to reduce >>>>> duplication and update to use this in the GUP check. Note that both separately >>>>> check vm_ops_needs_writenotify() as the latter needs to test this before the >>>>> vm_pgprot_modify() test, resulting in vma_wants_writenotify() checking this >>>>> twice, however it is such a small check this should not be egregious. >>>>> >>>>> v3: >>>>> - Rebased on latest mm-unstable as of 24th April 2023. >>>>> - Explicitly check whether file system requires folio dirtying. Note that >>>>> vma_wants_writenotify() could not be used directly as it is very much focused >>>>> on determining if the PTE r/w should be set (e.g. assuming private mapping >>>>> does not require it as already set, soft dirty considerations). >>>>> - Tested code against shmem and hugetlb mappings - confirmed that these are not >>>>> disallowed by the check. >>>>> - Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only >>>>> for FOLL_LONGTERM pins. >>>>> - As a result, limit check to internal GUP code. >>>>> https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/ >>>>> >>>>> v2: >>>>> - Add accidentally excluded ptrace_access_vm() use of >>>>> FOLL_ALLOW_BROKEN_FILE_MAPPING. >>>>> - Tweak commit message. >>>>> https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/ >>>>> >>>>> v1: >>>>> https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/ >>>>> >>>>> include/linux/mm.h | 1 + >>>>> mm/gup.c | 26 +++++++++++++++++++++++++- >>>>> mm/mmap.c | 37 ++++++++++++++++++++++++++++--------- >>>>> 3 files changed, 54 insertions(+), 10 deletions(-) >>>>> >>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>>>> index 37554b08bb28..f7da02fc89c6 100644 >>>>> --- a/include/linux/mm.h >>>>> +++ b/include/linux/mm.h >>>>> @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, >>>>> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ >>>>> MM_CP_UFFD_WP_RESOLVE) >>>>> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); >>>>> int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); >>>>> static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) >>>>> { >>>>> diff --git a/mm/gup.c b/mm/gup.c >>>>> index 1f72a717232b..53652453037c 100644 >>>>> --- a/mm/gup.c >>>>> +++ b/mm/gup.c >>>>> @@ -959,16 +959,37 @@ static int faultin_page(struct vm_area_struct *vma, >>>>> return 0; >>>>> } >>>>> +/* >>>>> + * Writing to file-backed mappings which require folio dirty tracking using GUP >>>>> + * is a fundamentally broken operation as kernel write access to GUP mappings >>>>> + * may not adhere to the semantics expected by a file system. >>>>> + */ >>>>> +static inline bool can_write_file_mapping(struct vm_area_struct *vma, >>>>> + unsigned long gup_flags) >>>>> +{ >>>>> + /* If we aren't pinning then no problematic write can occur. */ >>>>> + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) >>>>> + return true; >>>>> + >>>>> + /* We limit this check to the most egregious case - a long term pin. */ >>>>> + if (!(gup_flags & FOLL_LONGTERM)) >>>>> + return true; >>>>> + >>>>> + /* If the VMA requires dirty tracking then GUP will be problematic. */ >>>>> + return vma_needs_dirty_tracking(vma); >>>>> +} >>>>> + >>>>> static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) >>>>> { >>>>> vm_flags_t vm_flags = vma->vm_flags; >>>>> int write = (gup_flags & FOLL_WRITE); >>>>> int foreign = (gup_flags & FOLL_REMOTE); >>>>> + bool vma_anon = vma_is_anonymous(vma); >>>>> if (vm_flags & (VM_IO | VM_PFNMAP)) >>>>> return -EFAULT; >>>>> - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) >>>>> + if ((gup_flags & FOLL_ANON) && !vma_anon) >>>>> return -EFAULT; >>>>> if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) >>>>> @@ -978,6 +999,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) >>>>> return -EFAULT; >>>>> if (write) { >>>>> + if (!vma_anon && !can_write_file_mapping(vma, gup_flags)) >>>>> + return -EFAULT; >>>>> + >>>>> if (!(vm_flags & VM_WRITE)) { >>>>> if (!(gup_flags & FOLL_FORCE)) >>>>> return -EFAULT; >>>>> diff --git a/mm/mmap.c b/mm/mmap.c >>>>> index 536bbb8fa0ae..aac638dd22cf 100644 >>>>> --- a/mm/mmap.c >>>>> +++ b/mm/mmap.c >>>>> @@ -1475,6 +1475,32 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) >>>>> } >>>>> #endif /* __ARCH_WANT_SYS_OLD_MMAP */ >>>>> +/* Do VMA operations imply write notify is required? */ >>>>> +static inline bool vm_ops_needs_writenotify( >>>>> + const struct vm_operations_struct *vm_ops) >>>>> +{ >>>>> + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); >>>>> +} >>>>> + >>>>> +/* >>>>> + * Does this VMA require the underlying folios to have their dirty state >>>>> + * tracked? >>>>> + */ >>>>> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) >>>>> +{ >>>>> + /* Does the filesystem need to be notified? */ >>>>> + if (vm_ops_needs_writenotify(vma->vm_ops)) >>>>> + return true; >>>>> + >>>>> + /* Specialty mapping? */ >>>>> + if (vma->vm_flags & VM_PFNMAP) >>>>> + return false; >>>>> + >>>>> + /* Can the mapping track the dirty pages? */ >>>>> + return vma->vm_file && vma->vm_file->f_mapping && >>>>> + mapping_can_writeback(vma->vm_file->f_mapping); >>>>> +} >>>>> + >>>> >>>> What would be the exact reproducer of the problem? AFAIK writenotify is >>>> handled (by handle_mm_fault()) for non cow mappings (shared), where it only >>>> matters. >>> >>> The issue is reproduced simply by page_to_virt(pinned_page)[0] = 'x' :) >>> >>> The problem is that no faulting actually occurs, so no writenotify, and no >> >> >> Could you elaborate? GUP calls handle_mm_fault() that invokes the write >> notify the pte is made first writable. Of course, virt(pinned_page)[0] = 'x' >> is not supposed to fault while using the kernel mapping. >> > > The issue is how dirtying works. Typically for a dirty-tracking mapping the > kernel makes the mapping read-only, then when a write fault occurs, > writenotify is called and the folio is marked dirty. This way the file > system knows which files to writeback, then after writeback it 'cleans' > them, restoring the read-only mapping and relying on the NEXT write marking > write notifying and marking the folio dirty again. > I know how the dirty tracking works :). And gup itself actually triggers the _first_ fault on a read only pte. So the problem is accessing the page after that, somewehere in future. I think this is something you should write on the description. Because, technically, GUP itself works and does invoke the write notify. So the misleading part is you say in the description it doesn't. While you mean a later write, from a driver or such, doesn't. > If we GUP, _especially_ if it's long term, we run the risk of a write to > the folio _after_ it has been cleaned and if the caller tries to do the > 'right thing' and mark the folio dirty, it being marked dirty at an > unexpected time which might race with other things and thus oops. > > The issue is that this dirtying mechanism implicitly relies on the memory > _only_ being accessed via user mappings, but we're providing another way to > access that memory bypassing all that. > > It's been a fundamental flaw in GUP for some time. This change tries to > make things a little better by precluding the riskiest version of this > which is the caller indicating that the pin is longterm (via > FOLL_LONGTERM). > > For an example of a user trying to sensibly avoid this, see io_pin_pages() > in io_uring/rsrc.c. This is where io_uring tries to explicitly avoid this > themselves, something that GUP should clearly be doing. > > After this change, that code can be removed and we will live in a world > where linux has a saner GUP :) > > Of course we need to make more fundamental changes moving forward, the idea > is this improves the situation and eliminates the need for the open-coded > solution for io_uring which unblocks my other patch series which is also > trying to make GUP more sane. > > >> >> >>> PG_dirty tracking does either. Unexpected page dirtying can occur even >>> after they are cleaned in folio_clear_dirty_for_io(), because the caller >>> might manually mark the page dirty at an unexpected time as with the >>> unpin_*dirty*() helpers. >>> >>> I think the long-term solution is to provide a different interface where >>> pages are passed back briefly with locks held and with a manual invocation >>> of writeprotect, or perhaps some kthread_use_mm() thing so we actually >>> trigger the faulting logic, but in the meantime this change helps restore >>> some sanity. >>> >>>> >>>> GUP will only allow FOLL_FORCE without faulting for PageAnonExclusive pages. >>>> So if you want something beyond normal cow semantics you have custom vm_ops >>>> (and mmap() and fault()) >>> >>> This has nothing to do with FOLL_FORCE. >>> >>>> >>>> Also for longterm pinning gups vs fork vs swap there has been fixes by david >>>> recently. >>> >>> I don't think these are relevant in any way to this issue. >>> >>>> >>>> >>>> >>>>> /* >>>>> * Some shared mappings will want the pages marked read-only >>>>> * to track write events. If so, we'll downgrade vm_page_prot >>>>> @@ -1484,14 +1510,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) >>>>> int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) >>>>> { >>>>> vm_flags_t vm_flags = vma->vm_flags; >>>>> - const struct vm_operations_struct *vm_ops = vma->vm_ops; >>>>> /* If it was private or non-writable, the write bit is already clear */ >>>>> if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) >>>>> return 0; >>>>> /* The backer wishes to know when pages are first written to? */ >>>>> - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) >>>>> + if (vm_ops_needs_writenotify(vma->vm_ops)) >>>>> return 1; >>>>> /* The open routine did something to the protections that pgprot_modify >>>>> @@ -1511,13 +1536,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) >>>>> if (userfaultfd_wp(vma)) >>>>> return 1; >>>>> - /* Specialty mapping? */ >>>>> - if (vm_flags & VM_PFNMAP) >>>>> - return 0; >>>>> - >>>>> - /* Can the mapping track the dirty pages? */ >>>>> - return vma->vm_file && vma->vm_file->f_mapping && >>>>> - mapping_can_writeback(vma->vm_file->f_mapping); >>>>> + return vma_needs_dirty_tracking(vma); >>>>> } >>>>> /* >>>> >>>> >>>> --Mika >>>> >>> >> >> > --Mika
On Wed, Apr 26, 2023 at 10:30:03AM +0300, Mika Penttilä wrote: [snip] > > The issue is how dirtying works. Typically for a dirty-tracking mapping the > > kernel makes the mapping read-only, then when a write fault occurs, > > writenotify is called and the folio is marked dirty. This way the file > > system knows which files to writeback, then after writeback it 'cleans' > > them, restoring the read-only mapping and relying on the NEXT write marking > > write notifying and marking the folio dirty again. > > > > I know how the dirty tracking works :). And gup itself actually triggers the > _first_ fault on a read only pte. I'm sure you don't mean to, but this comes off as sarcastic, 'I know how X works :)' is not a helpful comment. However, equally apologies if I seemed patronising, not intentional, I am just trying to be as clear as possible, which always risks sounding that way :) Regardless, this is a very good point! I think I was a little too implicit in the whole 'at any time the kernel chooses to write to this writenotify won't happen', and you are absolutely right in that we are not clear enough about that. > > So the problem is accessing the page after that, somewehere in future. I > think this is something you should write on the description. Because, > technically, GUP itself works and does invoke the write notify. So the > misleading part is you say in the description it doesn't. While you mean a > later write, from a driver or such, doesn't. > Ack, agreed this would be a useful improvement. Will fix on next spin! [snip]
On 26.4.2023 11.41, Lorenzo Stoakes wrote: > On Wed, Apr 26, 2023 at 10:30:03AM +0300, Mika Penttilä wrote: > > [snip] > >>> The issue is how dirtying works. Typically for a dirty-tracking mapping the >>> kernel makes the mapping read-only, then when a write fault occurs, >>> writenotify is called and the folio is marked dirty. This way the file >>> system knows which files to writeback, then after writeback it 'cleans' >>> them, restoring the read-only mapping and relying on the NEXT write marking >>> write notifying and marking the folio dirty again. >>> >> >> I know how the dirty tracking works :). And gup itself actually triggers the >> _first_ fault on a read only pte. > > I'm sure you don't mean to, but this comes off as sarcastic, 'I know how X > works :)' is not a helpful comment. However, equally apologies if I seemed > patronising, not intentional, I am just trying to be as clear as possible, > which always risks sounding that way :) Absolutely didn't mean that way, and thanks for being clear here! > > Regardless, this is a very good point! I think I was a little too implicit > in the whole 'at any time the kernel chooses to write to this writenotify > won't happen', and you are absolutely right in that we are not clear enough > about that. > >> >> So the problem is accessing the page after that, somewehere in future. I >> think this is something you should write on the description. Because, >> technically, GUP itself works and does invoke the write notify. So the >> misleading part is you say in the description it doesn't. While you mean a >> later write, from a driver or such, doesn't. >> > > Ack, agreed this would be a useful improvement. Will fix on next spin! Yes thanks, think so, at least found myself going thru and wondering what's wrong with the gup code itself, and not the later usage scenario... > > [snip] > --Mika
On 4/26/23 00:20, Lorenzo Stoakes wrote: ... >> Could you elaborate? GUP calls handle_mm_fault() that invokes the write >> notify the pte is made first writable. Of course, virt(pinned_page)[0] = 'x' >> is not supposed to fault while using the kernel mapping. >> > > The issue is how dirtying works. Typically for a dirty-tracking mapping the > kernel makes the mapping read-only, then when a write fault occurs, > writenotify is called and the folio is marked dirty. This way the file > system knows which files to writeback, then after writeback it 'cleans' > them, restoring the read-only mapping and relying on the NEXT write marking > write notifying and marking the folio dirty again. > > If we GUP, _especially_ if it's long term, we run the risk of a write to > the folio _after_ it has been cleaned and if the caller tries to do the > 'right thing' and mark the folio dirty, it being marked dirty at an > unexpected time which might race with other things and thus oops. > > The issue is that this dirtying mechanism implicitly relies on the memory > _only_ being accessed via user mappings, but we're providing another way to > access that memory bypassing all that. > > It's been a fundamental flaw in GUP for some time. This change tries to > make things a little better by precluding the riskiest version of this > which is the caller indicating that the pin is longterm (via > FOLL_LONGTERM). > > For an example of a user trying to sensibly avoid this, see io_pin_pages() > in io_uring/rsrc.c. This is where io_uring tries to explicitly avoid this > themselves, something that GUP should clearly be doing. > > After this change, that code can be removed and we will live in a world > where linux has a saner GUP :) Hi Lorenzo, This is the "missing writeup" that really helps people visualize the problem, very nice. I think if you include the above, plus maybe a link to Jan Kara's original report [1], in the commit description, that will help a lot. > > Of course we need to make more fundamental changes moving forward, the idea > is this improves the situation and eliminates the need for the open-coded > solution for io_uring which unblocks my other patch series which is also > trying to make GUP more sane. It is true that solving the problem has taken "a few"...years, and is still not done, yes. :) I'll respond to the patch with a review, as well. [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/ thanks,
On 4/25/23 16:15, Lorenzo Stoakes wrote: > GUP does not correctly implement write-notify semantics, nor does it > guarantee that the underlying pages are correctly dirtied, which could lead > to a kernel oops or data corruption when writing to file-backed mappings. > > This is only relevant when the mappings are file-backed and the underlying > file system requires folio dirty tracking. File systems which do not, such > as shmem or hugetlb, are not at risk and therefore can be written to > without issue. > > Unfortunately this limitation of GUP has been present for some time and > requires future rework of the GUP API in order to provide correct write > access to such mappings. > > In the meantime, we add a check for the most broken GUP case - > FOLL_LONGTERM - which really under no circumstances can safely access > dirty-tracked file mappings. > Hi Lorenzo, As I mentioned in a sub-thread [1], it would be a nice touch to include your more detailed write-up, and a link to Jan Kara's original report, here. > As part of this change we separate out vma_needs_dirty_tracking() as a > helper function to determine this, which is distinct from > vma_wants_writenotify() which is specific to determining which PTE flags to > set. This, I think should go in a separate cleanup patch, because it is (nearly) the same behavior. More notes below on this. More notes below: > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > --- > v4: > - Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to reduce > duplication and update to use this in the GUP check. Note that both separately > check vm_ops_needs_writenotify() as the latter needs to test this before the > vm_pgprot_modify() test, resulting in vma_wants_writenotify() checking this > twice, however it is such a small check this should not be egregious. > > v3: > - Rebased on latest mm-unstable as of 24th April 2023. > - Explicitly check whether file system requires folio dirtying. Note that > vma_wants_writenotify() could not be used directly as it is very much focused > on determining if the PTE r/w should be set (e.g. assuming private mapping > does not require it as already set, soft dirty considerations). > - Tested code against shmem and hugetlb mappings - confirmed that these are not > disallowed by the check. > - Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only > for FOLL_LONGTERM pins. > - As a result, limit check to internal GUP code. > https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/ > > v2: > - Add accidentally excluded ptrace_access_vm() use of > FOLL_ALLOW_BROKEN_FILE_MAPPING. > - Tweak commit message. > https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/ > > v1: > https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/ > > include/linux/mm.h | 1 + > mm/gup.c | 26 +++++++++++++++++++++++++- > mm/mmap.c | 37 ++++++++++++++++++++++++++++--------- > 3 files changed, 54 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 37554b08bb28..f7da02fc89c6 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > MM_CP_UFFD_WP_RESOLVE) > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > { > diff --git a/mm/gup.c b/mm/gup.c > index 1f72a717232b..53652453037c 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -959,16 +959,37 @@ static int faultin_page(struct vm_area_struct *vma, > return 0; > } > > +/* > + * Writing to file-backed mappings which require folio dirty tracking using GUP > + * is a fundamentally broken operation as kernel write access to GUP mappings > + * may not adhere to the semantics expected by a file system. > + */ > +static inline bool can_write_file_mapping(struct vm_area_struct *vma, > + unsigned long gup_flags) Perhaps name this: writeable_file_mapping_allowed() ? "can" is more about "is this possible", whereas the goal here is to express, "should this be allowed". Also a silly tiny nit: let's omit the "inline" keyword, down here in this .c file, and let the compiler work that out instead. "static" should suffice. > +{ > + /* If we aren't pinning then no problematic write can occur. */ > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > + return true; > + > + /* We limit this check to the most egregious case - a long term pin. */ > + if (!(gup_flags & FOLL_LONGTERM)) > + return true; > + > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > + return vma_needs_dirty_tracking(vma); > +} > + > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > { > vm_flags_t vm_flags = vma->vm_flags; > int write = (gup_flags & FOLL_WRITE); > int foreign = (gup_flags & FOLL_REMOTE); > + bool vma_anon = vma_is_anonymous(vma); This name: bool file_backed = !vma_is_anonymous(vma); would lead to a slightly better reading experience below. Sorry for the small naming and documentation comments here, it's just what I do. :) > > if (vm_flags & (VM_IO | VM_PFNMAP)) > return -EFAULT; > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > + if ((gup_flags & FOLL_ANON) && !vma_anon) > return -EFAULT; > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > @@ -978,6 +999,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > return -EFAULT; > > if (write) { > + if (!vma_anon && !can_write_file_mapping(vma, gup_flags)) > + return -EFAULT; > + > if (!(vm_flags & VM_WRITE)) { > if (!(gup_flags & FOLL_FORCE)) > return -EFAULT; > diff --git a/mm/mmap.c b/mm/mmap.c > index 536bbb8fa0ae..aac638dd22cf 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1475,6 +1475,32 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > } > #endif /* __ARCH_WANT_SYS_OLD_MMAP */ > > +/* Do VMA operations imply write notify is required? */ > +static inline bool vm_ops_needs_writenotify( This "inline" should also be omitted, imho. > + const struct vm_operations_struct *vm_ops) > +{ > + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); > +} > + > +/* > + * Does this VMA require the underlying folios to have their dirty state > + * tracked? > + */ > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) > +{ > + /* Does the filesystem need to be notified? */ > + if (vm_ops_needs_writenotify(vma->vm_ops)) > + return true; > + > + /* Specialty mapping? */ > + if (vma->vm_flags & VM_PFNMAP) > + return false; > + > + /* Can the mapping track the dirty pages? */ > + return vma->vm_file && vma->vm_file->f_mapping && > + mapping_can_writeback(vma->vm_file->f_mapping); > +} > + > /* > * Some shared mappings will want the pages marked read-only > * to track write events. If so, we'll downgrade vm_page_prot > @@ -1484,14 +1510,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > { > vm_flags_t vm_flags = vma->vm_flags; > - const struct vm_operations_struct *vm_ops = vma->vm_ops; > > /* If it was private or non-writable, the write bit is already clear */ > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > return 0; > > /* The backer wishes to know when pages are first written to? */ > - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) > + if (vm_ops_needs_writenotify(vma->vm_ops)) OK, so here we are calling vm_ops_needs_writenotify(), that's the first call. And then... > return 1; > > /* The open routine did something to the protections that pgprot_modify > @@ -1511,13 +1536,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > if (userfaultfd_wp(vma)) > return 1; > > - /* Specialty mapping? */ > - if (vm_flags & VM_PFNMAP) > - return 0; > - > - /* Can the mapping track the dirty pages? */ > - return vma->vm_file && vma->vm_file->f_mapping && > - mapping_can_writeback(vma->vm_file->f_mapping); > + return vma_needs_dirty_tracking(vma); ...and now we call it again. I think once should be enough, though. Also, with the exception of that double call to vm_ops_needs_writenotify(), these changes to mmap.c are code cleanup that has the same behavior as before. As such, it's better to separate them out from this patch whose goal is very much to change behavior. [1] https://lore.kernel.org/all/1b9e3406-c08e-b97c-d46f-22f36535d9e5@nvidia.com/ thanks,
On Wed, Apr 26, 2023 at 05:52:09PM -0700, John Hubbard wrote: > On 4/25/23 16:15, Lorenzo Stoakes wrote: > > GUP does not correctly implement write-notify semantics, nor does it > > guarantee that the underlying pages are correctly dirtied, which could lead > > to a kernel oops or data corruption when writing to file-backed mappings. > > > > This is only relevant when the mappings are file-backed and the underlying > > file system requires folio dirty tracking. File systems which do not, such > > as shmem or hugetlb, are not at risk and therefore can be written to > > without issue. > > > > Unfortunately this limitation of GUP has been present for some time and > > requires future rework of the GUP API in order to provide correct write > > access to such mappings. > > > > In the meantime, we add a check for the most broken GUP case - > > FOLL_LONGTERM - which really under no circumstances can safely access > > dirty-tracked file mappings. > > > > Hi Lorenzo, > > As I mentioned in a sub-thread [1], it would be a nice touch to include > your more detailed write-up, and a link to Jan Kara's original report, > here. Ack will do. > > > > As part of this change we separate out vma_needs_dirty_tracking() as a > > helper function to determine this, which is distinct from > > vma_wants_writenotify() which is specific to determining which PTE flags to > > set. > > This, I think should go in a separate cleanup patch, because it is > (nearly) the same behavior. More notes below on this. > > More notes below: > > > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> > > --- > > v4: > > - Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to reduce > > duplication and update to use this in the GUP check. Note that both separately > > check vm_ops_needs_writenotify() as the latter needs to test this before the > > vm_pgprot_modify() test, resulting in vma_wants_writenotify() checking this > > twice, however it is such a small check this should not be egregious. > > > > v3: > > - Rebased on latest mm-unstable as of 24th April 2023. > > - Explicitly check whether file system requires folio dirtying. Note that > > vma_wants_writenotify() could not be used directly as it is very much focused > > on determining if the PTE r/w should be set (e.g. assuming private mapping > > does not require it as already set, soft dirty considerations). > > - Tested code against shmem and hugetlb mappings - confirmed that these are not > > disallowed by the check. > > - Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only > > for FOLL_LONGTERM pins. > > - As a result, limit check to internal GUP code. > > https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/ > > > > v2: > > - Add accidentally excluded ptrace_access_vm() use of > > FOLL_ALLOW_BROKEN_FILE_MAPPING. > > - Tweak commit message. > > https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/ > > > > v1: > > https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/ > > > > include/linux/mm.h | 1 + > > mm/gup.c | 26 +++++++++++++++++++++++++- > > mm/mmap.c | 37 ++++++++++++++++++++++++++++--------- > > 3 files changed, 54 insertions(+), 10 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 37554b08bb28..f7da02fc89c6 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, > > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > > MM_CP_UFFD_WP_RESOLVE) > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); > > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); > > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) > > { > > diff --git a/mm/gup.c b/mm/gup.c > > index 1f72a717232b..53652453037c 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -959,16 +959,37 @@ static int faultin_page(struct vm_area_struct *vma, > > return 0; > > } > > +/* > > + * Writing to file-backed mappings which require folio dirty tracking using GUP > > + * is a fundamentally broken operation as kernel write access to GUP mappings > > + * may not adhere to the semantics expected by a file system. > > + */ > > +static inline bool can_write_file_mapping(struct vm_area_struct *vma, > > + unsigned long gup_flags) > > Perhaps name this: > writeable_file_mapping_allowed() > > ? "can" is more about "is this possible", whereas the goal here is > to express, "should this be allowed". Yes I struggled a bit with how best to name this, I think your suggestion is better > > Also a silly tiny nit: let's omit the "inline" keyword, down here in this > .c file, and let the compiler work that out instead. "static" should suffice. Ack, I saw some inconsistency in gup.c about this so wondered which way to go, obviously it's not really all that useful here (compiler will inline if it wants) > > > +{ > > + /* If we aren't pinning then no problematic write can occur. */ > > + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) > > + return true; > > + > > + /* We limit this check to the most egregious case - a long term pin. */ > > + if (!(gup_flags & FOLL_LONGTERM)) > > + return true; > > + > > + /* If the VMA requires dirty tracking then GUP will be problematic. */ > > + return vma_needs_dirty_tracking(vma); > > +} > > + > > static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > > { > > vm_flags_t vm_flags = vma->vm_flags; > > int write = (gup_flags & FOLL_WRITE); > > int foreign = (gup_flags & FOLL_REMOTE); > > + bool vma_anon = vma_is_anonymous(vma); > > This name: > > bool file_backed = !vma_is_anonymous(vma); > > would lead to a slightly better reading experience below. Well you see, I'm not so sure about that, because vma_is_anonymous() checks vm_ops == NULL not vm_file == NULL which can be the case for a special mapping like VDSO that is not in fact file-backed :) the horror, the horror. > > Sorry for the small naming and documentation comments here, > it's just what I do. :) > > > > if (vm_flags & (VM_IO | VM_PFNMAP)) > > return -EFAULT; > > - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) > > + if ((gup_flags & FOLL_ANON) && !vma_anon) > > return -EFAULT; > > if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) > > @@ -978,6 +999,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) > > return -EFAULT; > > if (write) { > > + if (!vma_anon && !can_write_file_mapping(vma, gup_flags)) > > + return -EFAULT; > > + > > if (!(vm_flags & VM_WRITE)) { > > if (!(gup_flags & FOLL_FORCE)) > > return -EFAULT; > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 536bbb8fa0ae..aac638dd22cf 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -1475,6 +1475,32 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > > } > > #endif /* __ARCH_WANT_SYS_OLD_MMAP */ > > +/* Do VMA operations imply write notify is required? */ > > +static inline bool vm_ops_needs_writenotify( > > This "inline" should also be omitted, imho. > Same comment as before as to why I did that, but ack > > + const struct vm_operations_struct *vm_ops) > > +{ > > + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); > > +} > > + > > +/* > > + * Does this VMA require the underlying folios to have their dirty state > > + * tracked? > > + */ > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) > > +{ > > + /* Does the filesystem need to be notified? */ > > + if (vm_ops_needs_writenotify(vma->vm_ops)) > > + return true; > > + > > + /* Specialty mapping? */ > > + if (vma->vm_flags & VM_PFNMAP) > > + return false; > > + > > + /* Can the mapping track the dirty pages? */ > > + return vma->vm_file && vma->vm_file->f_mapping && > > + mapping_can_writeback(vma->vm_file->f_mapping); > > +} > > + > > /* > > * Some shared mappings will want the pages marked read-only > > * to track write events. If so, we'll downgrade vm_page_prot > > @@ -1484,14 +1510,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) > > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > > { > > vm_flags_t vm_flags = vma->vm_flags; > > - const struct vm_operations_struct *vm_ops = vma->vm_ops; > > /* If it was private or non-writable, the write bit is already clear */ > > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > > return 0; > > /* The backer wishes to know when pages are first written to? */ > > - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) > > + if (vm_ops_needs_writenotify(vma->vm_ops)) > > OK, so here we are calling vm_ops_needs_writenotify(), that's the > first call. And then... > > > return 1; > > /* The open routine did something to the protections that pgprot_modify > > @@ -1511,13 +1536,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) > > if (userfaultfd_wp(vma)) > > return 1; > > - /* Specialty mapping? */ > > - if (vm_flags & VM_PFNMAP) > > - return 0; > > - > > - /* Can the mapping track the dirty pages? */ > > - return vma->vm_file && vma->vm_file->f_mapping && > > - mapping_can_writeback(vma->vm_file->f_mapping); > > + return vma_needs_dirty_tracking(vma); > > ...and now we call it again. I think once should be enough, though. Right, this was intentional (I think I mentioned it in the revision notes?), because there is a conundrum here - the invocation from vma_wants_writenotify() needs to check this _first_ before performing the _other_ checks in vma_needs_dirty_tracking(), but external calls need all the checks. It'd be ugly to pass a boolean to see if we should check this or not, and it's hardly an egregious duplication for the _computer_ (something likely in a cache line != NULL) which aids readability and reduces duplication for the _reader_ of the code for a path that is inherently slow (likely going to fault in pages etc.) I think it'd be confusing to have yet another split into vma_can_track_dirty() or whatever because then suddenly for the check to be meaningful you have to _always_ check 2 things. Other options like passing an output parameter or returning something other than boolean are equally distasteful. > > Also, with the exception of that double call to > vm_ops_needs_writenotify(), these changes to mmap.c are code cleanup > that has the same behavior as before. As such, it's better to separate > them out from this patch whose goal is very much to change behavior. It's not really cleanup, it's separating out some of the logic explicitly to be used in this new context, without which the separation would not be useful, so I feel it's a bit over the top to turn a small single patch into two simply to avoid this. > > > [1] https://lore.kernel.org/all/1b9e3406-c08e-b97c-d46f-22f36535d9e5@nvidia.com/ > > thanks, > -- > John Hubbard > NVIDIA > Thanks for the review, I will respin with the suggestions (other than ones I don't quite agree with as explained above) and a clearer description in line with Mika's suggestions. Hopefully we can move closer to this actually getting some reviewed/acked-by tags soon :)
On 4/27/23 15:31, Lorenzo Stoakes wrote: ... >> bool file_backed = !vma_is_anonymous(vma); >> >> would lead to a slightly better reading experience below. > > Well you see, I'm not so sure about that, because vma_is_anonymous() checks > vm_ops == NULL not vm_file == NULL which can be the case for a special > mapping like VDSO that is not in fact file-backed :) the horror, the > horror. > Yes, you are right. It looks like vma_anon is a better name here, after all. ... >> ...and now we call it again. I think once should be enough, though. > > Right, this was intentional (I think I mentioned it in the revision > notes?), because there is a conundrum here - the invocation from > vma_wants_writenotify() needs to check this _first_ before performing the > _other_ checks in vma_needs_dirty_tracking(), but external calls need all > the checks. It'd be ugly to pass a boolean to see if we should check this > or not, and it's hardly an egregious duplication for the _computer_ > (something likely in a cache line != NULL) which aids readability and > reduces duplication for the _reader_ of the code for a path that is > inherently slow (likely going to fault in pages etc.) > > I think it'd be confusing to have yet another split into > vma_can_track_dirty() or whatever because then suddenly for the check to be > meaningful you have to _always_ check 2 things. > > Other options like passing an output parameter or returning something other > than boolean are equally distasteful. Agreed. (And yes, I overlooked that discussion in the version notes.) > >> >> Also, with the exception of that double call to >> vm_ops_needs_writenotify(), these changes to mmap.c are code cleanup >> that has the same behavior as before. As such, it's better to separate >> them out from this patch whose goal is very much to change behavior. > > It's not really cleanup, it's separating out some of the logic explicitly > to be used in this new context, without which the separation would not be > useful, so I feel it's a bit over the top to turn a small single patch into > two simply to avoid this. > Sure, OK. >> > > Thanks for the review, I will respin with the suggestions (other than ones > I don't quite agree with as explained above) and a clearer description in > line with Mika's suggestions. > > Hopefully we can move closer to this actually getting some > reviewed/acked-by tags soon :) thanks,
diff --git a/include/linux/mm.h b/include/linux/mm.h index 37554b08bb28..f7da02fc89c6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2433,6 +2433,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ MM_CP_UFFD_WP_RESOLVE) +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) { diff --git a/mm/gup.c b/mm/gup.c index 1f72a717232b..53652453037c 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -959,16 +959,37 @@ static int faultin_page(struct vm_area_struct *vma, return 0; } +/* + * Writing to file-backed mappings which require folio dirty tracking using GUP + * is a fundamentally broken operation as kernel write access to GUP mappings + * may not adhere to the semantics expected by a file system. + */ +static inline bool can_write_file_mapping(struct vm_area_struct *vma, + unsigned long gup_flags) +{ + /* If we aren't pinning then no problematic write can occur. */ + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) + return true; + + /* We limit this check to the most egregious case - a long term pin. */ + if (!(gup_flags & FOLL_LONGTERM)) + return true; + + /* If the VMA requires dirty tracking then GUP will be problematic. */ + return vma_needs_dirty_tracking(vma); +} + static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) { vm_flags_t vm_flags = vma->vm_flags; int write = (gup_flags & FOLL_WRITE); int foreign = (gup_flags & FOLL_REMOTE); + bool vma_anon = vma_is_anonymous(vma); if (vm_flags & (VM_IO | VM_PFNMAP)) return -EFAULT; - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) + if ((gup_flags & FOLL_ANON) && !vma_anon) return -EFAULT; if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) @@ -978,6 +999,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) return -EFAULT; if (write) { + if (!vma_anon && !can_write_file_mapping(vma, gup_flags)) + return -EFAULT; + if (!(vm_flags & VM_WRITE)) { if (!(gup_flags & FOLL_FORCE)) return -EFAULT; diff --git a/mm/mmap.c b/mm/mmap.c index 536bbb8fa0ae..aac638dd22cf 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1475,6 +1475,32 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) } #endif /* __ARCH_WANT_SYS_OLD_MMAP */ +/* Do VMA operations imply write notify is required? */ +static inline bool vm_ops_needs_writenotify( + const struct vm_operations_struct *vm_ops) +{ + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); +} + +/* + * Does this VMA require the underlying folios to have their dirty state + * tracked? + */ +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) +{ + /* Does the filesystem need to be notified? */ + if (vm_ops_needs_writenotify(vma->vm_ops)) + return true; + + /* Specialty mapping? */ + if (vma->vm_flags & VM_PFNMAP) + return false; + + /* Can the mapping track the dirty pages? */ + return vma->vm_file && vma->vm_file->f_mapping && + mapping_can_writeback(vma->vm_file->f_mapping); +} + /* * Some shared mappings will want the pages marked read-only * to track write events. If so, we'll downgrade vm_page_prot @@ -1484,14 +1510,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) { vm_flags_t vm_flags = vma->vm_flags; - const struct vm_operations_struct *vm_ops = vma->vm_ops; /* If it was private or non-writable, the write bit is already clear */ if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) return 0; /* The backer wishes to know when pages are first written to? */ - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite)) + if (vm_ops_needs_writenotify(vma->vm_ops)) return 1; /* The open routine did something to the protections that pgprot_modify @@ -1511,13 +1536,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) if (userfaultfd_wp(vma)) return 1; - /* Specialty mapping? */ - if (vm_flags & VM_PFNMAP) - return 0; - - /* Can the mapping track the dirty pages? */ - return vma->vm_file && vma->vm_file->f_mapping && - mapping_can_writeback(vma->vm_file->f_mapping); + return vma_needs_dirty_tracking(vma); } /*
GUP does not correctly implement write-notify semantics, nor does it guarantee that the underlying pages are correctly dirtied, which could lead to a kernel oops or data corruption when writing to file-backed mappings. This is only relevant when the mappings are file-backed and the underlying file system requires folio dirty tracking. File systems which do not, such as shmem or hugetlb, are not at risk and therefore can be written to without issue. Unfortunately this limitation of GUP has been present for some time and requires future rework of the GUP API in order to provide correct write access to such mappings. In the meantime, we add a check for the most broken GUP case - FOLL_LONGTERM - which really under no circumstances can safely access dirty-tracked file mappings. As part of this change we separate out vma_needs_dirty_tracking() as a helper function to determine this, which is distinct from vma_wants_writenotify() which is specific to determining which PTE flags to set. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> --- v4: - Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to reduce duplication and update to use this in the GUP check. Note that both separately check vm_ops_needs_writenotify() as the latter needs to test this before the vm_pgprot_modify() test, resulting in vma_wants_writenotify() checking this twice, however it is such a small check this should not be egregious. v3: - Rebased on latest mm-unstable as of 24th April 2023. - Explicitly check whether file system requires folio dirtying. Note that vma_wants_writenotify() could not be used directly as it is very much focused on determining if the PTE r/w should be set (e.g. assuming private mapping does not require it as already set, soft dirty considerations). - Tested code against shmem and hugetlb mappings - confirmed that these are not disallowed by the check. - Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only for FOLL_LONGTERM pins. - As a result, limit check to internal GUP code. https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/ v2: - Add accidentally excluded ptrace_access_vm() use of FOLL_ALLOW_BROKEN_FILE_MAPPING. - Tweak commit message. https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/ v1: https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/ include/linux/mm.h | 1 + mm/gup.c | 26 +++++++++++++++++++++++++- mm/mmap.c | 37 ++++++++++++++++++++++++++++--------- 3 files changed, 54 insertions(+), 10 deletions(-)