Message ID | 20211201193750.2097885-1-catalin.marinas@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Catalin, On Wed, Dec 1, 2021 at 8:38 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > Hi, > > Following the discussions on the first series, > > https://lore.kernel.org/r/20211124192024.2408218-1-catalin.marinas@arm.com > > this new patchset aims to generalise the sub-page probing and introduce > a minimum size to the fault_in_*() functions. I called this 'v2' but I > can rebase it on top of v1 and keep v1 as a btrfs live-lock > back-portable fix. that's what I was actually expecting, an updated patch series that changes the btrfs code to keep track of the user-copy fault address, the corresponding changes to the fault_in functions to call the appropriate arch functions, and the arch functions that probe far enough from the fault address to prevent deadlocks. In this step, how far the arch functions need to probe depends on the fault windows of the user-copy functions. A next step (as a patch series on top) would be to make sure direct I/O also takes sub-page faults into account. That seems to require probing the entire address range before the actual copying. A concern I have about this is time-of-check versus time-of-use: what if sub-page faults are added after the probing but before the copying? Other than that, an approach like adding min_size parameters might work except that maybe we can find a better name. Also, in order not to make things even more messy, the fault_in functions should probably continue to report how much of the address range they've failed to fault in. Callers can then check for themselves whether the function could fault in min_size bytes or not. > The fault_in_*() API improvements would be a new > series. Anyway, I'd first like to know whether this is heading in the > right direction and whether it's worth adding min_size to all > fault_in_*() (more below). > > v2 adds a 'min_size' argument to all fault_in_*() functions with current > callers passing 0 (or we could make it 1). A probe_subpage_*() call is > made for the min_size range, though with all 0 this wouldn't have any > effect. The only difference is btrfs search_ioctl() in the last patch > which passes a non-zero min_size to avoid the live-lock (functionally > that's the same as the v1 series). In the btrfs case, the copying will already trigger sub-page faults; we only need to make sure that the next fault-in attempt happens at the fault address. (And that the fault_in functions take the user-copy fuzz into account, which we also need for byte granularity copying anyway.) Otherwise, we're creating the same time-of-check versus time-of-use disparity as for direct-IO here, unnecessarily. > In terms of sub-page probing, I don't think with the current kernel > anything other than search_ioctl() matters. The buffered file I/O can > already cope with current fault_in_*() + copy_*_user() loops (the > uaccess makes progress). Direct I/O either goes via GUP + kernel mapping > access (and memcpy() can't fault) or, if the user buffer is not PAGE > aligned, it may fall back to buffered I/O. So we really only care about > fault_in_writeable(), as in v1. Yes from a regression point of view, but note that direct I/O still circumvents the sub-page fault checking, which seems to defeat the whole point. > Linus suggested that we could use the min_size to request a minimum > guaranteed probed size (in most cases this would be 1) and put a cap on > the faulted-in size, say two pages. All the fault_in_iov_iter_*() > callers will need to check the actual quantity returned by fault_in_*() > rather than bail out on non-zero but Andreas has a patch already (though > I think there are a few cases in btrfs etc.): > > https://lore.kernel.org/r/20211123151812.361624-1-agruenba@redhat.com > > With these callers fixed, we could add something like the diff below. > But, again, min_size doesn't actually have any current use in the kernel > other than fault_in_writeable() and search_ioctl(). We're trying pretty hard to handle large I/O requests efficiently at the filesystem level. A small, static upper limit in the fault-in functions has the potential to ruin those efforts. So I'm not a fan of that. > Thanks for having a look. Suggestions welcomed. > > ------------------8<------------------------------- > diff --git a/mm/gup.c b/mm/gup.c > index 7fa69b0fb859..3aa88aa8ce9d 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start, > } > #endif /* !CONFIG_MMU */ > > +#define MAX_FAULT_IN_SIZE (2 * PAGE_SIZE) > + > /** > * fault_in_writeable - fault in userspace address range for writing > * @uaddr: start of address range > @@ -1671,6 +1673,7 @@ size_t fault_in_writeable(char __user *uaddr, size_t size, size_t min_size) > { > char __user *start = uaddr, *end; > size_t faulted_in = size; > + size_t max_size = max_t(size_t, MAX_FAULT_IN_SIZE, min_size); > > if (unlikely(size == 0)) > return 0; > @@ -1679,7 +1682,7 @@ size_t fault_in_writeable(char __user *uaddr, size_t size, size_t min_size) > return size; > uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr); > } > - end = (char __user *)PAGE_ALIGN((unsigned long)start + size); > + end = (char __user *)PAGE_ALIGN((unsigned long)start + max_size); > if (unlikely(end < start)) > end = NULL; > while (uaddr != end) { > @@ -1726,9 +1729,10 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size, > struct vm_area_struct *vma = NULL; > int locked = 0; > size_t faulted_in = size; > + size_t max_size = max_t(size_t, MAX_FAULT_IN_SIZE, min_size); > > nstart = start & PAGE_MASK; > - end = PAGE_ALIGN(start + size); > + end = PAGE_ALIGN(start + max_size); > if (end < nstart) > end = 0; > for (; nstart != end; nstart = nend) { > @@ -1759,7 +1763,7 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size, > if (locked) > mmap_read_unlock(mm); > if (nstart != end) > - faulted_in = min_t(size_t, nstart - start, size); > + faulted_in = min_t(size_t, nstart - start, max_size); > if (faulted_in < min_size || > (min_size && probe_subpage_safe_writeable(uaddr, min_size))) > return size; > @@ -1782,6 +1786,7 @@ size_t fault_in_readable(const char __user *uaddr, size_t size, > const char __user *start = uaddr, *end; > volatile char c; > size_t faulted_in = size; > + size_t max_size = max_t(size_t, MAX_FAULT_IN_SIZE, min_size); > > if (unlikely(size == 0)) > return 0; > @@ -1790,7 +1795,7 @@ size_t fault_in_readable(const char __user *uaddr, size_t size, > return size; > uaddr = (const char __user *)PAGE_ALIGN((unsigned long)uaddr); > } > - end = (const char __user *)PAGE_ALIGN((unsigned long)start + size); > + end = (const char __user *)PAGE_ALIGN((unsigned long)start + max_size); > if (unlikely(end < start)) > end = NULL; > while (uaddr != end) { > ------------------8<------------------------------- > > Catalin Marinas (4): > mm: Introduce a 'min_size' argument to fault_in_*() > mm: Probe for sub-page faults in fault_in_*() > arm64: Add support for user sub-page fault probing > btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page > faults > > arch/Kconfig | 7 ++++ > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/uaccess.h | 59 +++++++++++++++++++++++++++++ > arch/powerpc/kernel/kvm.c | 2 +- > arch/powerpc/kernel/signal_32.c | 4 +- > arch/powerpc/kernel/signal_64.c | 2 +- > arch/x86/kernel/fpu/signal.c | 2 +- > drivers/gpu/drm/armada/armada_gem.c | 2 +- > fs/btrfs/file.c | 6 +-- > fs/btrfs/ioctl.c | 7 +++- > fs/f2fs/file.c | 2 +- > fs/fuse/file.c | 2 +- > fs/gfs2/file.c | 8 ++-- > fs/iomap/buffered-io.c | 2 +- > fs/ntfs/file.c | 2 +- > fs/ntfs3/file.c | 2 +- > include/linux/pagemap.h | 8 ++-- > include/linux/uaccess.h | 53 ++++++++++++++++++++++++++ > include/linux/uio.h | 6 ++- > lib/iov_iter.c | 28 +++++++++++--- > mm/filemap.c | 2 +- > mm/gup.c | 37 +++++++++++++----- > 22 files changed, 203 insertions(+), 41 deletions(-) > Thanks, Andreas
On Fri, Dec 3, 2021 at 7:29 AM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > > We're trying pretty hard to handle large I/O requests efficiently at > the filesystem level. A small, static upper limit in the fault-in > functions has the potential to ruin those efforts. So I'm not a fan of > that. I don't think fault-in should happen under any sane normal circumstances. Except for low-memory situations, and then you don't want to fault in large areas. Do you really expect to write big areas that the user has never even touched? That would be literally insane. And if the user _has_ touched them, then they'll in in-core. Except for the "swapped out" case. End result: this is purely a correctness issue, not a performance issue. Linus
On Fri, Dec 3, 2021 at 6:58 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Dec 3, 2021 at 7:29 AM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > We're trying pretty hard to handle large I/O requests efficiently at > > the filesystem level. A small, static upper limit in the fault-in > > functions has the potential to ruin those efforts. So I'm not a fan of > > that. > > I don't think fault-in should happen under any sane normal circumstances. > > Except for low-memory situations, and then you don't want to fault in > large areas. > > Do you really expect to write big areas that the user has never even > touched? That would be literally insane. > > And if the user _has_ touched them, then they'll in in-core. Except > for the "swapped out" case. > > End result: this is purely a correctness issue, not a performance issue. It happens when you mmap a file and write the mmapped region to another file, for example. I don't think we want to make filesystems go bonkers in such scenarios. Scaling down in response to memory pressure sounds perfectly fine though. Thanks, Andreas
On Fri, Dec 3, 2021 at 10:12 AM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > It happens when you mmap a file and write the mmapped region to > another file, for example. Do you actually have such loads? Nobody should use mmap() for single-access file copy purposes. It's slower than just doing the copy exactly due to page fault overhead. In other words, you seem to be worrying about the performance of a load that is _explicitly_ badly written. You should be fixing the application, not making the kernel do stupid things. Also, it's worth noting that that situation should be caught by the page-in code, which will map multiple pages in one go (do_fault_around() - for when the pages are cached), and do the readahead logic (filemap_fault() - for when the pages aren't in the page cache). Both of which are a lot more important than the "synchronously fault in pages one at a time". Linus
Hi Andreas, On Fri, Dec 03, 2021 at 04:29:18PM +0100, Andreas Gruenbacher wrote: > On Wed, Dec 1, 2021 at 8:38 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > Following the discussions on the first series, > > > > https://lore.kernel.org/r/20211124192024.2408218-1-catalin.marinas@arm.com > > > > this new patchset aims to generalise the sub-page probing and introduce > > a minimum size to the fault_in_*() functions. I called this 'v2' but I > > can rebase it on top of v1 and keep v1 as a btrfs live-lock > > back-portable fix. > > that's what I was actually expecting, an updated patch series that > changes the btrfs code to keep track of the user-copy fault address, > the corresponding changes to the fault_in functions to call the > appropriate arch functions, and the arch functions that probe far > enough from the fault address to prevent deadlocks. In this step, how > far the arch functions need to probe depends on the fault windows of > the user-copy functions. I have that series as well, see the top patch here (well, you've seen it already): https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix But I'm not convinced it's worth it if we go for the approach in v2 here. A key difference between v2 and the above branch is that probe_subpage_writeable() checks exactly what is given (min_size) in v2 while in the devel/btrfs-live-lock-fix branch it can be given a PAGE_SIZE or more but only checks the beginning 16 bytes to cover the copy_to_user() error margin. The latter assumes that the caller will always attempt the fault_in() from where the uaccess failed rather than relying on the fault_in() itself to avoid the live-lock. v1 posted earlier also checks the full range but only in fault_in_writeable() which seems to be only relevant for btrfs in the arm64 case. Maybe I should post the other series as an alternative, get some input on it. > A next step (as a patch series on top) would be to make sure direct > I/O also takes sub-page faults into account. That seems to require > probing the entire address range before the actual copying. A concern > I have about this is time-of-check versus time-of-use: what if > sub-page faults are added after the probing but before the copying? With direct I/O (that doesn't fall back to buffered), the access is done via the kernel mapping following a get_user_pages(). Since the access here cannot cope with exceptions, it must be unchecked. Yes, we do have the time-of-use vs check problem but I'm not worried. I regard MTE as a probabilistic security feature. Things get even murkier if the I/O is done by some DMA engine which ignores tags anyway. CHERI, OTOH, is a lot more strict but there is no check vs use issue here since all permissions are encoded in the pointer itself (we might just expand access_ok() to take this into account). We could use the min_size logic for this I think in functions like gfs2_file_direct_read() you'd have to fault in first and than invoke iomap_dio_rw(). Anyway, like I said before, I'd leave the MTE accesses for direct I/O unchecked as they currently are, I don't think it's worth the effort and the potential slow-down (it will be significant). > Other than that, an approach like adding min_size parameters might > work except that maybe we can find a better name. Also, in order not > to make things even more messy, the fault_in functions should probably > continue to report how much of the address range they've failed to > fault in. Callers can then check for themselves whether the function > could fault in min_size bytes or not. That's fine as well. I did it this way because I found the logic easier to write. > > The fault_in_*() API improvements would be a new > > series. Anyway, I'd first like to know whether this is heading in the > > right direction and whether it's worth adding min_size to all > > fault_in_*() (more below). > > > > v2 adds a 'min_size' argument to all fault_in_*() functions with current > > callers passing 0 (or we could make it 1). A probe_subpage_*() call is > > made for the min_size range, though with all 0 this wouldn't have any > > effect. The only difference is btrfs search_ioctl() in the last patch > > which passes a non-zero min_size to avoid the live-lock (functionally > > that's the same as the v1 series). > > In the btrfs case, the copying will already trigger sub-page faults; > we only need to make sure that the next fault-in attempt happens at > the fault address. (And that the fault_in functions take the user-copy > fuzz into account, which we also need for byte granularity copying > anyway.) Otherwise, we're creating the same time-of-check versus > time-of-use disparity as for direct-IO here, unnecessarily. I don't think it matters for btrfs. In some way, you'd have the time of check vs use problem even if you fault in from where uaccess failed. It's just that in practice it's impossible to live-lock as it needs very precise synchronisation to change the tags from another CPU. But you do guarantee that the uaccess was correct. > > In terms of sub-page probing, I don't think with the current kernel > > anything other than search_ioctl() matters. The buffered file I/O can > > already cope with current fault_in_*() + copy_*_user() loops (the > > uaccess makes progress). Direct I/O either goes via GUP + kernel mapping > > access (and memcpy() can't fault) or, if the user buffer is not PAGE > > aligned, it may fall back to buffered I/O. So we really only care about > > fault_in_writeable(), as in v1. > > Yes from a regression point of view, but note that direct I/O still > circumvents the sub-page fault checking, which seems to defeat the > whole point. It doesn't entirely defeat it. From my perspective MTE is more of a best effort to find use-after-free etc. bugs. It has a performance penalty and I wouldn't want to make it worse. Some libc allocators even go for untagged memory (unchecked) if the required size is over some threshold (usually when it falls back to multiple page allocations). That's more likely to be involved in direct I/O anyway, so the additional check in fault_in() won't matter. > > Linus suggested that we could use the min_size to request a minimum > > guaranteed probed size (in most cases this would be 1) and put a cap on > > the faulted-in size, say two pages. All the fault_in_iov_iter_*() > > callers will need to check the actual quantity returned by fault_in_*() > > rather than bail out on non-zero but Andreas has a patch already (though > > I think there are a few cases in btrfs etc.): > > > > https://lore.kernel.org/r/20211123151812.361624-1-agruenba@redhat.com > > > > With these callers fixed, we could add something like the diff below. > > But, again, min_size doesn't actually have any current use in the kernel > > other than fault_in_writeable() and search_ioctl(). > > We're trying pretty hard to handle large I/O requests efficiently at > the filesystem level. A small, static upper limit in the fault-in > functions has the potential to ruin those efforts. So I'm not a fan of > that. I can't comment on this, I haven't spent time in the fs land. But I did notice that generic_perform_write() for example limits the fault_in() to PAGE_SIZE. So this min_size potential optimisation wouldn't make any difference.
diff --git a/mm/gup.c b/mm/gup.c index 7fa69b0fb859..3aa88aa8ce9d 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start, } #endif /* !CONFIG_MMU */ +#define MAX_FAULT_IN_SIZE (2 * PAGE_SIZE) + /** * fault_in_writeable - fault in userspace address range for writing * @uaddr: start of address range @@ -1671,6 +1673,7 @@ size_t fault_in_writeable(char __user *uaddr, size_t size, size_t min_size) { char __user *start = uaddr, *end; size_t faulted_in = size; + size_t max_size = max_t(size_t, MAX_FAULT_IN_SIZE, min_size); if (unlikely(size == 0)) return 0; @@ -1679,7 +1682,7 @@ size_t fault_in_writeable(char __user *uaddr, size_t size, size_t min_size) return size; uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr); } - end = (char __user *)PAGE_ALIGN((unsigned long)start + size); + end = (char __user *)PAGE_ALIGN((unsigned long)start + max_size); if (unlikely(end < start)) end = NULL; while (uaddr != end) { @@ -1726,9 +1729,10 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size, struct vm_area_struct *vma = NULL; int locked = 0; size_t faulted_in = size; + size_t max_size = max_t(size_t, MAX_FAULT_IN_SIZE, min_size); nstart = start & PAGE_MASK; - end = PAGE_ALIGN(start + size); + end = PAGE_ALIGN(start + max_size); if (end < nstart) end = 0; for (; nstart != end; nstart = nend) { @@ -1759,7 +1763,7 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size, if (locked) mmap_read_unlock(mm); if (nstart != end) - faulted_in = min_t(size_t, nstart - start, size); + faulted_in = min_t(size_t, nstart - start, max_size); if (faulted_in < min_size || (min_size && probe_subpage_safe_writeable(uaddr, min_size))) return size; @@ -1782,6 +1786,7 @@ size_t fault_in_readable(const char __user *uaddr, size_t size, const char __user *start = uaddr, *end; volatile char c; size_t faulted_in = size; + size_t max_size = max_t(size_t, MAX_FAULT_IN_SIZE, min_size); if (unlikely(size == 0)) return 0; @@ -1790,7 +1795,7 @@ size_t fault_in_readable(const char __user *uaddr, size_t size, return size; uaddr = (const char __user *)PAGE_ALIGN((unsigned long)uaddr); } - end = (const char __user *)PAGE_ALIGN((unsigned long)start + size); + end = (const char __user *)PAGE_ALIGN((unsigned long)start + max_size); if (unlikely(end < start)) end = NULL; while (uaddr != end) {