Message ID | cover.1683235180.git.lstoakes@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/gup: disallow GUP writing to file-backed mappings by default | expand |
On 04.05.23 23:27, Lorenzo Stoakes wrote: > Writing to file-backed mappings which require folio dirty tracking using > GUP is a fundamentally broken operation, as kernel write access to GUP > mappings do not adhere to the semantics expected by a file system. > > A GUP caller uses the direct mapping to access the folio, which does not > cause write notify to trigger, nor does it enforce that the caller marks > the folio dirty. > > The problem arises when, after an initial write to the folio, writeback > results in the folio being cleaned and then the caller, via the GUP > interface, writes to the folio again. > > As a result of the use of this secondary, direct, mapping to the folio no > write notify will occur, and if the caller does mark the folio dirty, this > will be done so unexpectedly. > > For example, consider the following scenario:- > > 1. A folio is written to via GUP which write-faults the memory, notifying > the file system and dirtying the folio. > 2. Later, writeback is triggered, resulting in the folio being cleaned and > the PTE being marked read-only. > 3. The GUP caller writes to the folio, as it is mapped read/write via the > direct mapping. > 4. The GUP caller, now done with the page, unpins it and sets it dirty > (though it does not have to). > > This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As > pin_user_pages_fast_only() does not exist, we can rely on a slightly > imperfect whitelisting in the PUP-fast case and fall back to the slow case > should this fail. > > Thanks a lot, this looks pretty good to me! I started writing some selftests (assuming none would be in the works) using iouring and and the gup_tests interface. So far, no real surprises for the general GUP interaction [1]. There are two things I noticed when registering an iouring fixed buffer (that differ now from generic gup_test usage): (1) Registering a fixed buffer targeting an unsupported MAP_SHARED FS file now fails with EFAULT (from pin_user_pages()) instead of EOPNOTSUPP (from io_pin_pages()). The man page for io_uring_register documents: EOPNOTSUPP User buffers point to file-backed memory. ... we'd have to do some kind of errno translation in io_pin_pages(). But the translation is not simple (sometimes we want to forward EOPNOTSUPP). That also applies once we remove that special-casing in io_uring code. ... maybe we can simply update the manpage (stating that older kernels returned EOPNOTSUPP) and start returning EFAULT? (2) Registering a fixed buffer targeting a MAP_PRIVATE FS file fails with EOPNOTSUPP (from io_pin_pages()). As discussed, there is nothing wrong with pinning all-anon pages (resulting from breaking COW). That could be easily be handled (allow any !VM_MAYSHARE), and would automatically be handled once removing the iouring special-casing. [1] # ./pin_longterm # [INFO] detected hugetlb size: 2048 KiB # [INFO] detected hugetlb size: 1048576 KiB TAP version 13 1..50 # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd ok 1 Pinning succeeded as expected # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile ok 2 Pinning succeeded as expected # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile ok 3 Pinning failed as expected # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB) ok 4 # SKIP need more free huge pages # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB) ok 5 Pinning succeeded as expected # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd ok 6 Pinning succeeded as expected # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile ok 7 Pinning succeeded as expected # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile ok 8 Pinning failed as expected # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB) ok 9 # SKIP need more free huge pages # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB) ok 10 Pinning succeeded as expected # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd ok 11 Pinning succeeded as expected # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with tmpfile ok 12 Pinning succeeded as expected # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile ok 13 Pinning succeeded as expected # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB) ok 14 # SKIP need more free huge pages # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB) ok 15 Pinning succeeded as expected # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd ok 16 Pinning succeeded as expected # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile ok 17 Pinning succeeded as expected # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile ok 18 Pinning succeeded as expected # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB) ok 19 # SKIP need more free huge pages # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB) ok 20 Pinning succeeded as expected # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd ok 21 Pinning succeeded as expected # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile ok 22 Pinning succeeded as expected # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile ok 23 Pinning succeeded as expected # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB) ok 24 # SKIP need more free huge pages # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB) ok 25 Pinning succeeded as expected # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd ok 26 Pinning succeeded as expected # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile ok 27 Pinning succeeded as expected # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile ok 28 Pinning succeeded as expected # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB) ok 29 # SKIP need more free huge pages # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB) ok 30 Pinning succeeded as expected # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd ok 31 Pinning succeeded as expected # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile ok 32 Pinning succeeded as expected # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile ok 33 Pinning succeeded as expected # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB) ok 34 # SKIP need more free huge pages # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB) ok 35 Pinning succeeded as expected # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd ok 36 Pinning succeeded as expected # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile ok 37 Pinning succeeded as expected # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile ok 38 Pinning succeeded as expected # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB) ok 39 # SKIP need more free huge pages # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB) ok 40 Pinning succeeded as expected # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with memfd ok 41 Pinning succeeded as expected # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with tmpfile ok 42 Pinning succeeded as expected # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with local tmpfile ok 43 Pinning failed as expected # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with memfd hugetlb (2048 kB) ok 44 # SKIP need more free huge pages # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB) ok 45 Pinning succeeded as expected # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with memfd ok 46 Pinning succeeded as expected # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with tmpfile ok 47 Pinning succeeded as expected # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with local tmpfile not ok 48 Pinning failed as expected # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB) ok 49 # SKIP need more free huge pages # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB) ok 50 Pinning succeeded as expected Bail out! 1 out of 50 tests failed # Totals: pass:39 fail:1 xfail:0 xpass:0 skip:10 error:0
On Fri, May 05, 2023 at 10:21:21PM +0200, David Hildenbrand wrote: > On 04.05.23 23:27, Lorenzo Stoakes wrote: > > Writing to file-backed mappings which require folio dirty tracking using > > GUP is a fundamentally broken operation, as kernel write access to GUP > > mappings do not adhere to the semantics expected by a file system. > > > > A GUP caller uses the direct mapping to access the folio, which does not > > cause write notify to trigger, nor does it enforce that the caller marks > > the folio dirty. > > > > The problem arises when, after an initial write to the folio, writeback > > results in the folio being cleaned and then the caller, via the GUP > > interface, writes to the folio again. > > > > As a result of the use of this secondary, direct, mapping to the folio no > > write notify will occur, and if the caller does mark the folio dirty, this > > will be done so unexpectedly. > > > > For example, consider the following scenario:- > > > > 1. A folio is written to via GUP which write-faults the memory, notifying > > the file system and dirtying the folio. > > 2. Later, writeback is triggered, resulting in the folio being cleaned and > > the PTE being marked read-only. > > 3. The GUP caller writes to the folio, as it is mapped read/write via the > > direct mapping. > > 4. The GUP caller, now done with the page, unpins it and sets it dirty > > (though it does not have to). > > > > This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As > > pin_user_pages_fast_only() does not exist, we can rely on a slightly > > imperfect whitelisting in the PUP-fast case and fall back to the slow case > > should this fail. > > > > > > Thanks a lot, this looks pretty good to me! Thanks! > > I started writing some selftests (assuming none would be in the works) using > iouring and and the gup_tests interface. So far, no real surprises for the general > GUP interaction [1]. > Nice! I was using the cow selftests as just looking for something that touches FOLL_LONGTERM with PUP_fast, I hacked it so it always wrote just to test patches but clearly we need something more thorough. > > There are two things I noticed when registering an iouring fixed buffer (that differ > now from generic gup_test usage): > > > (1) Registering a fixed buffer targeting an unsupported MAP_SHARED FS file now fails with > EFAULT (from pin_user_pages()) instead of EOPNOTSUPP (from io_pin_pages()). > > The man page for io_uring_register documents: > > EOPNOTSUPP > User buffers point to file-backed memory. > > ... we'd have to do some kind of errno translation in io_pin_pages(). But the > translation is not simple (sometimes we want to forward EOPNOTSUPP). That also > applies once we remove that special-casing in io_uring code. > > ... maybe we can simply update the manpage (stating that older kernels returned > EOPNOTSUPP) and start returning EFAULT? Yeah I noticed this discrepancy when going through initial attempts to refactor in the vmas patch series, I wonder how important it is to differentiate? I have a feeling it probably doesn't matter too much but obviously need input from Jens and Pavel. > > > (2) Registering a fixed buffer targeting a MAP_PRIVATE FS file fails with EOPNOTSUPP > (from io_pin_pages()). As discussed, there is nothing wrong with pinning all-anon > pages (resulting from breaking COW). > > That could be easily be handled (allow any !VM_MAYSHARE), and would automatically be > handled once removing the iouring special-casing. The entire intent of this series (for me :)) was to allow io_uring to just drop this code altogether so we can unblock my drop the 'vmas' parameter from GUP series [1]. I always intended to respin that after this settled down, Jens and Pavel seemed onboard with this (and really they shouldn't need to be doing that check, that was always a failing in GUP). I will do a v5 of this soon. [1]: https://lore.kernel.org/all/cover.1681831798.git.lstoakes@gmail.com/ > > > [1] > > # ./pin_longterm > # [INFO] detected hugetlb size: 2048 KiB > # [INFO] detected hugetlb size: 1048576 KiB > TAP version 13 > 1..50 > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd > ok 1 Pinning succeeded as expected > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile > ok 2 Pinning succeeded as expected > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile > ok 3 Pinning failed as expected > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB) > ok 4 # SKIP need more free huge pages > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB) > ok 5 Pinning succeeded as expected > # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd > ok 6 Pinning succeeded as expected > # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile > ok 7 Pinning succeeded as expected > # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile > ok 8 Pinning failed as expected > # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB) > ok 9 # SKIP need more free huge pages > # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB) > ok 10 Pinning succeeded as expected > # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd > ok 11 Pinning succeeded as expected > # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with tmpfile > ok 12 Pinning succeeded as expected > # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile > ok 13 Pinning succeeded as expected > # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB) > ok 14 # SKIP need more free huge pages > # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB) > ok 15 Pinning succeeded as expected > # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd > ok 16 Pinning succeeded as expected > # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile > ok 17 Pinning succeeded as expected > # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile > ok 18 Pinning succeeded as expected > # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB) > ok 19 # SKIP need more free huge pages > # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB) > ok 20 Pinning succeeded as expected > # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd > ok 21 Pinning succeeded as expected > # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile > ok 22 Pinning succeeded as expected > # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile > ok 23 Pinning succeeded as expected > # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB) > ok 24 # SKIP need more free huge pages > # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB) > ok 25 Pinning succeeded as expected > # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd > ok 26 Pinning succeeded as expected > # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile > ok 27 Pinning succeeded as expected > # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile > ok 28 Pinning succeeded as expected > # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB) > ok 29 # SKIP need more free huge pages > # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB) > ok 30 Pinning succeeded as expected > # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd > ok 31 Pinning succeeded as expected > # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile > ok 32 Pinning succeeded as expected > # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile > ok 33 Pinning succeeded as expected > # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB) > ok 34 # SKIP need more free huge pages > # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB) > ok 35 Pinning succeeded as expected > # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd > ok 36 Pinning succeeded as expected > # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile > ok 37 Pinning succeeded as expected > # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile > ok 38 Pinning succeeded as expected > # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB) > ok 39 # SKIP need more free huge pages > # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB) > ok 40 Pinning succeeded as expected > # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with memfd > ok 41 Pinning succeeded as expected > # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with tmpfile > ok 42 Pinning succeeded as expected > # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with local tmpfile > ok 43 Pinning failed as expected > # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with memfd hugetlb (2048 kB) > ok 44 # SKIP need more free huge pages > # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB) > ok 45 Pinning succeeded as expected > # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with memfd > ok 46 Pinning succeeded as expected > # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with tmpfile > ok 47 Pinning succeeded as expected > # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with local tmpfile > not ok 48 Pinning failed as expected > # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB) > ok 49 # SKIP need more free huge pages > # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB) > ok 50 Pinning succeeded as expected > Bail out! 1 out of 50 tests failed > # Totals: pass:39 fail:1 xfail:0 xpass:0 skip:10 error:0 > > > -- > Thanks, > > David / dhildenb >
On Thu, May 04, 2023 at 10:27:50PM +0100, Lorenzo Stoakes wrote: > Writing to file-backed mappings which require folio dirty tracking using > GUP is a fundamentally broken operation, as kernel write access to GUP > mappings do not adhere to the semantics expected by a file system. > > A GUP caller uses the direct mapping to access the folio, which does not > cause write notify to trigger, nor does it enforce that the caller marks > the folio dirty. > > The problem arises when, after an initial write to the folio, writeback > results in the folio being cleaned and then the caller, via the GUP > interface, writes to the folio again. > > As a result of the use of this secondary, direct, mapping to the folio no > write notify will occur, and if the caller does mark the folio dirty, this > will be done so unexpectedly. > > For example, consider the following scenario:- > > 1. A folio is written to via GUP which write-faults the memory, notifying > the file system and dirtying the folio. > 2. Later, writeback is triggered, resulting in the folio being cleaned and > the PTE being marked read-only. > 3. The GUP caller writes to the folio, as it is mapped read/write via the > direct mapping. > 4. The GUP caller, now done with the page, unpins it and sets it dirty > (though it does not have to). > > This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As > pin_user_pages_fast_only() does not exist, we can rely on a slightly > imperfect whitelisting in the PUP-fast case and fall back to the slow case > should this fail. [snip] As discussed at LSF/MM, on the flight over I wrote a little repro [0] which reliably triggers the ext4 warning by recreating the scenario described above, using a small userland program and kernel module. This code is not perfect (plane code :) but does seem to do the job adequately, also obviously this should only be run in a VM environment where data loss is acceptable (in my case a small qemu instance). Hopefully this is useful in some way. Note that I explicitly use pin_user_pages() without FOLL_LONGTERM here in order to not run into the mitigation this very patch series provides! Obviously if you revert this series you can see the same happening with FOLL_LONGTERM set. I have licensed the code as GPLv2 so anybody's free to do with it as they will if it's useful in any way! [0]:https://github.com/lorenzo-stoakes/gup-repro
On Sun, May 14, 2023 at 08:20:04PM +0100, Lorenzo Stoakes wrote: > As discussed at LSF/MM, on the flight over I wrote a little repro [0] which > reliably triggers the ext4 warning by recreating the scenario described > above, using a small userland program and kernel module. > > This code is not perfect (plane code :) but does seem to do the job > adequately, also obviously this should only be run in a VM environment > where data loss is acceptable (in my case a small qemu instance). It would be really awesome if you could wire it up with and submit it to xfstests.
On Thu, May 04, 2023 at 10:27:50PM +0100, Lorenzo Stoakes wrote: > Writing to file-backed mappings which require folio dirty tracking using > GUP is a fundamentally broken operation, as kernel write access to GUP > mappings do not adhere to the semantics expected by a file system. > > A GUP caller uses the direct mapping to access the folio, which does not > cause write notify to trigger, nor does it enforce that the caller marks > the folio dirty. Okay, problem is clear and the patchset look good to me. But I'm worried breaking existing users. Do we expect the change to be visible to real world users? If yes, are we okay to break them? One thing that came to mind is KVM with "qemu -object memory-backend-file,share=on..." It is mostly used for pmem emulation. Do we have plan B? Just a random/crazy/broken idea: - Allow folio_mkclean() (and folio_clear_dirty_for_io()) to fail, indicating that the page cannot be cleared because it is pinned; - Introduce a new vm_operations_struct::mkclean() that would be called by page_vma_mkclean_one() before clearing the range and can fail; - On GUP, create an in-kernel fake VMA that represents the file, but with custom vm_ops. The VMA registered in rmap to get notified on folio_mkclean() and fail it because of GUP. - folio_clear_dirty_for_io() callers will handle the new failure as indication that the page can be written back but will stay dirty and fs-specific data that is associated with the page writeback cannot be freed. I'm sure the idea is broken on many levels (I have never looked closely at the writeback path). But maybe it is good enough as conversation started?
On Mon, May 15, 2023 at 02:03:15PM +0300, Kirill A . Shutemov wrote: > On Thu, May 04, 2023 at 10:27:50PM +0100, Lorenzo Stoakes wrote: > > Writing to file-backed mappings which require folio dirty tracking using > > GUP is a fundamentally broken operation, as kernel write access to GUP > > mappings do not adhere to the semantics expected by a file system. > > > > A GUP caller uses the direct mapping to access the folio, which does not > > cause write notify to trigger, nor does it enforce that the caller marks > > the folio dirty. > > Okay, problem is clear and the patchset look good to me. But I'm worried > breaking existing users. > > Do we expect the change to be visible to real world users? If yes, are we > okay to break them? The general consensus at the moment is that there is no entirely reasonable usage of this case and you're already running the riks of a kernel oops if you do this, so it's already broken. > > One thing that came to mind is KVM with "qemu -object memory-backend-file,share=on..." > It is mostly used for pmem emulation. > > Do we have plan B? Yes, we can make it opt-in or opt-out via a FOLL_FLAG. This would be easy to implement in the event of any issues arising. > > Just a random/crazy/broken idea: > > - Allow folio_mkclean() (and folio_clear_dirty_for_io()) to fail, > indicating that the page cannot be cleared because it is pinned; > > - Introduce a new vm_operations_struct::mkclean() that would be called by > page_vma_mkclean_one() before clearing the range and can fail; > > - On GUP, create an in-kernel fake VMA that represents the file, but with > custom vm_ops. The VMA registered in rmap to get notified on > folio_mkclean() and fail it because of GUP. > > - folio_clear_dirty_for_io() callers will handle the new failure as > indication that the page can be written back but will stay dirty and > fs-specific data that is associated with the page writeback cannot be > freed. > > I'm sure the idea is broken on many levels (I have never looked closely at > the writeback path). But maybe it is good enough as conversation started? > Yeah there are definitely a few ideas down this road that might be possible, I am not sure how a filesystem can be expected to cope or this to be reasonably used without dirty/writeback though because you'll just not track anything or I guess you mean the mapping would be read-only but somehow stay dirty? I also had ideas along these lines of e.g. having a special vmalloc mode which mimics the correct wrprotect settings + does the right thing, but of course that does nothing to help DMA writing to a GUP-pinned page. Though if the issue is at the point of the kernel marking the page dirty unexpectedly, perhaps we can just invoke the mkwrite() _there_ before marking dirty? There are probably some sycnhronisation issues there too. Jason will have some thoughts on this I'm sure. I guess the key question here is - is it actually feasible for this to work at all? Once we establish that, the rest are details :) > -- > Kiryl Shutsemau / Kirill A. Shutemov
On Sun, May 14, 2023 at 10:14:46PM -0700, Christoph Hellwig wrote: > On Sun, May 14, 2023 at 08:20:04PM +0100, Lorenzo Stoakes wrote: > > As discussed at LSF/MM, on the flight over I wrote a little repro [0] which > > reliably triggers the ext4 warning by recreating the scenario described > > above, using a small userland program and kernel module. > > > > This code is not perfect (plane code :) but does seem to do the job > > adequately, also obviously this should only be run in a VM environment > > where data loss is acceptable (in my case a small qemu instance). > > It would be really awesome if you could wire it up with and submit it > to xfstests. Sure am happy to take a look at that! Also happy if David finds it useful in any way for this unit tests. The kernel module interface is a bit sketchy (it takes a user address which it blindly pins for you) so it's not something that should be run in any unsafe environment but as long as we are ok with that :)
On Mon, May 15, 2023 at 12:16:21PM +0100, Lorenzo Stoakes wrote: > > One thing that came to mind is KVM with "qemu -object memory-backend-file,share=on..." > > It is mostly used for pmem emulation. > > > > Do we have plan B? > > Yes, we can make it opt-in or opt-out via a FOLL_FLAG. This would be easy > to implement in the event of any issues arising. I'm becoming less keen on the idea of a per-subsystem opt out. I think we should make a kernel wide opt out. I like the idea of using lower lockdown levels. Lots of things become unavaiable in the uAPI when the lockdown level increases already. > Jason will have some thoughts on this I'm sure. I guess the key question > here is - is it actually feasible for this to work at all? Once we > establish that, the rest are details :) Surely it is, but like Ted said, the FS folks are not interested and they are at least half the solution.. The FS also has to actively not write out the page while it cannot be write protected unless it copies the data to a stable page. The block stack needs the source data to be stable to do checksum/parity/etc stuff. It is a complicated subject. Jason
On Mon, May 15, 2023 at 09:12:49AM -0300, Jason Gunthorpe wrote: > On Mon, May 15, 2023 at 12:16:21PM +0100, Lorenzo Stoakes wrote: > > > One thing that came to mind is KVM with "qemu -object memory-backend-file,share=on..." > > > It is mostly used for pmem emulation. > > > > > > Do we have plan B? > > > > Yes, we can make it opt-in or opt-out via a FOLL_FLAG. This would be easy > > to implement in the event of any issues arising. > > I'm becoming less keen on the idea of a per-subsystem opt out. I think > we should make a kernel wide opt out. I like the idea of using lower > lockdown levels. Lots of things become unavaiable in the uAPI when the > lockdown level increases already. This would be the 'safest' in the sense that a user can't be surprised by higher lockdown = access modes disallowed, however we'd _definitely_ need to have an opt-in in that instance so io_uring can make use of this regardless. That's easy to add however. If we do go down that road, we can be even stricter/vary what we do at different levels right? > > > Jason will have some thoughts on this I'm sure. I guess the key question > > here is - is it actually feasible for this to work at all? Once we > > establish that, the rest are details :) > > Surely it is, but like Ted said, the FS folks are not interested and > they are at least half the solution.. :'( > > The FS also has to actively not write out the page while it cannot be > write protected unless it copies the data to a stable page. The block > stack needs the source data to be stable to do checksum/parity/etc > stuff. It is a complicated subject. Yes my sense was that being able to write arbitrarily to these pages _at all_ was a big issue, not only the dirty tracking aspect. I guess at some level letting filesystems have such total flexibility as to how they implement things leaves us in a difficult position. > > Jason
On Mon 15-05-23 14:07:57, Lorenzo Stoakes wrote: > On Mon, May 15, 2023 at 09:12:49AM -0300, Jason Gunthorpe wrote: > > On Mon, May 15, 2023 at 12:16:21PM +0100, Lorenzo Stoakes wrote: > > > Jason will have some thoughts on this I'm sure. I guess the key question > > > here is - is it actually feasible for this to work at all? Once we > > > establish that, the rest are details :) > > > > Surely it is, but like Ted said, the FS folks are not interested and > > they are at least half the solution.. > > :'( Well, I'd phrase this a bit differently - it is a difficult sell to fs maintainers that they should significantly complicate writeback code / VFS with bounce page handling etc. for a thing that is not much used corner case. So if we can get away with forbiding long-term pins, then that's the easiest solution. Dealing with short-term pins is easier as we can just wait for unpinning which is implementable in a localized manner. > > The FS also has to actively not write out the page while it cannot be > > write protected unless it copies the data to a stable page. The block > > stack needs the source data to be stable to do checksum/parity/etc > > stuff. It is a complicated subject. > > Yes my sense was that being able to write arbitrarily to these pages _at > all_ was a big issue, not only the dirty tracking aspect. Yes. > I guess at some level letting filesystems have such total flexibility as to > how they implement things leaves us in a difficult position. I'm not sure what you mean by "total flexibility" here. In my opinion it is also about how HW performs checksumming etc. Honza
On Wed, May 17, 2023 at 09:29:20AM +0200, Jan Kara wrote: > On Mon 15-05-23 14:07:57, Lorenzo Stoakes wrote: > > On Mon, May 15, 2023 at 09:12:49AM -0300, Jason Gunthorpe wrote: > > > On Mon, May 15, 2023 at 12:16:21PM +0100, Lorenzo Stoakes wrote: > > > > Jason will have some thoughts on this I'm sure. I guess the key question > > > > here is - is it actually feasible for this to work at all? Once we > > > > establish that, the rest are details :) > > > > > > Surely it is, but like Ted said, the FS folks are not interested and > > > they are at least half the solution.. > > > > :'( > > Well, I'd phrase this a bit differently - it is a difficult sell to fs > maintainers that they should significantly complicate writeback code / VFS > with bounce page handling etc. for a thing that is not much used corner > case. So if we can get away with forbiding long-term pins, then that's the > easiest solution. Dealing with short-term pins is easier as we can just > wait for unpinning which is implementable in a localized manner. > Totally understandable. It's unfortunately I feel a case of something we should simply not have allowed. > > > The FS also has to actively not write out the page while it cannot be > > > write protected unless it copies the data to a stable page. The block > > > stack needs the source data to be stable to do checksum/parity/etc > > > stuff. It is a complicated subject. > > > > Yes my sense was that being able to write arbitrarily to these pages _at > > all_ was a big issue, not only the dirty tracking aspect. > > Yes. > > > I guess at some level letting filesystems have such total flexibility as to > > how they implement things leaves us in a difficult position. > > I'm not sure what you mean by "total flexibility" here. In my opinion it is > also about how HW performs checksumming etc. I mean to say *_ops allow a lot of flexibility in how things are handled. Certainly checksumming is a great example but in theory an arbitrary filesystem could be doing, well, anything and always assuming that only userland mappings should be modifying the underlying data. > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Wed, May 17, 2023 at 09:29:20AM +0200, Jan Kara wrote: > > > Surely it is, but like Ted said, the FS folks are not interested and > > > they are at least half the solution.. > > > > :'( > > Well, I'd phrase this a bit differently - it is a difficult sell to fs > maintainers that they should significantly complicate writeback code / VFS > with bounce page handling etc. for a thing that is not much used corner > case. So if we can get away with forbiding long-term pins, then that's the > easiest solution. Dealing with short-term pins is easier as we can just > wait for unpinning which is implementable in a localized manner. Full agreement here. The whole concept of supporting writeback for long term mappings does not make much sense. > > > The FS also has to actively not write out the page while it cannot be > > > write protected unless it copies the data to a stable page. The block > > > stack needs the source data to be stable to do checksum/parity/etc > > > stuff. It is a complicated subject. > > > > Yes my sense was that being able to write arbitrarily to these pages _at > > all_ was a big issue, not only the dirty tracking aspect. > > Yes. > > > I guess at some level letting filesystems have such total flexibility as to > > how they implement things leaves us in a difficult position. > > I'm not sure what you mean by "total flexibility" here. In my opinion it is > also about how HW performs checksumming etc. I have no idea what total flexbility is even supposed to be.
On Wed, May 17, 2023 at 08:40:26AM +0100, Lorenzo Stoakes wrote: > > I'm not sure what you mean by "total flexibility" here. In my opinion it is > > also about how HW performs checksumming etc. > > I mean to say *_ops allow a lot of flexibility in how things are > handled. Certainly checksumming is a great example but in theory an > arbitrary filesystem could be doing, well, anything and always assuming > that only userland mappings should be modifying the underlying data. File systems need a wait to track when a page is dirtied so that it can be written back. Not much to do with flexbility.
On Wed, May 17, 2023 at 12:43:34AM -0700, Christoph Hellwig wrote: > On Wed, May 17, 2023 at 08:40:26AM +0100, Lorenzo Stoakes wrote: > > > I'm not sure what you mean by "total flexibility" here. In my opinion it is > > > also about how HW performs checksumming etc. > > > > I mean to say *_ops allow a lot of flexibility in how things are > > handled. Certainly checksumming is a great example but in theory an > > arbitrary filesystem could be doing, well, anything and always assuming > > that only userland mappings should be modifying the underlying data. > > File systems need a wait to track when a page is dirtied so that it can > be written back. Not much to do with flexbility. I'll try to take this in good faith because... yeah. I do get that, I mean I literally created a repro for this situation and referenced in the commit msg and comments this precise problem in my patch series that addresses... this problem :P Perhaps I'm not being clear but it was simply my intent to highlight that yes this is the primary problem but ALSO GUP writing to ostensibly 'clean' pages 'behind the back' of a fs is _also_ a problem. Not least for checksumming (e.g. assume hw-reported checksum for a block == checksum derived from page cache) but, because VFS allows a great deal of flexibility in how filesystems are implemented, perhaps in other respects we haven't considered. So I just wanted to highlight (happy to be corrected if I'm wrong) that the PRIMARY problem is the dirty tracking breaking, but also strikes me that arbitrary writes to 'clean' pages in the background is one too.
On Wed, May 17, 2023 at 08:55:27AM +0100, Lorenzo Stoakes wrote: > I'll try to take this in good faith because... yeah. I do get that, I mean > I literally created a repro for this situation and referenced in the commit > msg and comments this precise problem in my patch series that > addresses... this problem :P > > Perhaps I'm not being clear but it was simply my intent to highlight that > yes this is the primary problem but ALSO GUP writing to ostensibly 'clean' > pages 'behind the back' of a fs is _also_ a problem. Yes, it absolutely is a problem if that happens. But we can just fix it in the kernel using the: lock_page() copy data set_page_dirty_locked() unlock_page(); pattern, and we should have covere every place that did in tree. But there's no good way to verify it except for regular code audits.
On 15.05.23 13:31, Lorenzo Stoakes wrote: > On Sun, May 14, 2023 at 10:14:46PM -0700, Christoph Hellwig wrote: >> On Sun, May 14, 2023 at 08:20:04PM +0100, Lorenzo Stoakes wrote: >>> As discussed at LSF/MM, on the flight over I wrote a little repro [0] which >>> reliably triggers the ext4 warning by recreating the scenario described >>> above, using a small userland program and kernel module. >>> >>> This code is not perfect (plane code :) but does seem to do the job >>> adequately, also obviously this should only be run in a VM environment >>> where data loss is acceptable (in my case a small qemu instance). >> >> It would be really awesome if you could wire it up with and submit it >> to xfstests. > > Sure am happy to take a look at that! Also happy if David finds it useful in any > way for this unit tests. I played with a simple selftest that would reuse the existing gup_test infrastructure (adding PIN_LONGTERM_TEST_WRITE), and try reproducing an actual data corruption. So far, I was not able to reproduce any corruption easily without your patches, because d824ec2a1546 ("mm: do not reclaim private data from pinned page") seems to mitigate most of it. So ... before my patches (adding PIN_LONGTERM_TEST_WRITE) I cannot test it from a selftest, with d824ec2a1546 ("mm: do not reclaim private data from pinned page") I cannot reproduce and with your patches long-term pinning just fails. Long story short: I'll most probably not add such a test but instead keep testing that long-term pinning works/fails now as expected, based on the FS type. > > The kernel module interface is a bit sketchy (it takes a user address which it > blindly pins for you) so it's not something that should be run in any unsafe > environment but as long as we are ok with that :) I can submit the PIN_LONGTERM_TEST_WRITE extension, that would allow to test with a stock kernel that has the module compiled in. It won't allow !longterm, though (it would be kind-of hacky to have !longterm controlled by user space, even if it's a GUP test module). Finding an actual reproducer using existing pinning functionality would be preferred. For example, using O_DIRECT (should be possible even before it starts using FOLL_PIN instead of FOLL_GET). That would be highly racy then, but most probably not impossible. Such (racy) tests are not a good fit for selftests. Maybe I'll have a try later to reproduce with O_DIRECT.