mbox series

[RFC,0/7] block, fs: convert Direct IO to FOLL_PIN

Message ID 20220225085025.3052894-1-jhubbard@nvidia.com (mailing list archive)
Headers show
Series block, fs: convert Direct IO to FOLL_PIN | expand

Message

John Hubbard Feb. 25, 2022, 8:50 a.m. UTC
Hi,

Summary:

This puts some prerequisites in place, including a CONFIG parameter,
making it possible to start converting and testing the Direct IO part of
each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().

It will take "a few" kernel releases to get the whole thing done.

Details:

As part of fixing the "get_user_pages() + file-backed memory" problem
[1], and to support various COW-related fixes as well [2], we need to
convert the Direct IO code from get_user_pages_fast(), to
pin_user_pages_fast(). Because pin_user_pages*() calls require a
corresponding call to unpin_user_page(), the conversion is more
elaborate than just substitution.

Further complicating the conversion, the block/bio layers get their
Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
each of which has a large number of callers. All of those callers need
to be audited and changed so that they call unpin_user_page(), rather
than put_page().

After quite some time exploring and consulting with people as well, it
is clear that this cannot be done in just one patchset. That's because,
not only is this large and time-consuming (for example, Chaitanya
Kulkarni's first reaction, after looking into the details, was, "convert
the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
but it is also spread across many filesystems.

With that in mind, let's apply most of this patchset soon-ish, and then
work on the filesystem conversions, likely over the course of a few
kernel releases. Once complete, then apply the last patch, and then one
final name change to remove the dio_w_ prefixes, and get us back to the
original names.

In this patchset:

Patches 1, 2, 3: provide the prerequisites to start converting call
sites to call the new dio_w_*() wrapper functions.

Patch 4: convert the core allocation routines to
dio_w_pin_user_pages_fast().

Patches 5, 6: convert a couple of callers (NFS, fuse) to use FOLL_PIN.
This also is a placeholder to show that "filesystems need to be
converted at this point".

At this point, Ubuntu 20.04 boots up and is able to support running some
fio direct IO tests, while keeping the foll pin counts in /proc/vmstat
balanced. (Ubuntu uses fuse during startup, interestingly enough.)

Patch 7: Get rid of the CONFIG parameter, thus effectively switching the
default Direct IO mechanism over to pin_user_pages_fast().

(Not shown): Patch 8: trivial but large: rename everything to get rid of
the dio_w_ prefix, and delete the wrappers.

This is based on mmotm as of about an hour ago. I've also stashed it
here:

    https://github.com/johnhubbard/linux bio_pup_mmotm_20220224

[1] https://lwn.net/Articles/753027/ "The trouble with get_user_pages()"

[2] https://lore.kernel.org/all/20211217113049.23850-1-david@redhat.com/T/#u
    (David Hildenbrand's mm/COW fixes)

John Hubbard (7):
  mm/gup: introduce pin_user_page()
  block: add dio_w_*() wrappers for pin, unpin user pages
  block, fs: assert that key paths use iovecs, and nothing else
  block, bio, fs: initial pin_user_pages_fast() changes
  NFS: direct-io: convert to FOLL_PIN pages
  fuse: convert direct IO paths to use FOLL_PIN
  block, direct-io: flip the switch: use pin_user_pages_fast()

 block/bio.c          | 22 +++++++++++++---------
 block/blk-map.c      |  4 ++--
 fs/direct-io.c       | 26 ++++++++++++++------------
 fs/fuse/dev.c        |  5 ++++-
 fs/fuse/file.c       | 23 ++++++++---------------
 fs/iomap/direct-io.c |  2 +-
 fs/nfs/direct.c      |  2 +-
 include/linux/bvec.h |  4 ++++
 include/linux/mm.h   |  1 +
 lib/iov_iter.c       |  4 ++--
 mm/gup.c             | 34 ++++++++++++++++++++++++++++++++++
 11 files changed, 84 insertions(+), 43 deletions(-)


base-commit: 218d3ca9c0ea1c35f1bc5099325b7df54b52bbdd
prerequisite-patch-id: 7d5a742e37171a15d83a9b3ac9ba0951b573eed8
--
2.35.1

Comments

Jan Kara Feb. 25, 2022, 12:05 p.m. UTC | #1
On Fri 25-02-22 00:50:18, John Hubbard wrote:
> Hi,
> 
> Summary:
> 
> This puts some prerequisites in place, including a CONFIG parameter,
> making it possible to start converting and testing the Direct IO part of
> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().
> 
> It will take "a few" kernel releases to get the whole thing done.
> 
> Details:
> 
> As part of fixing the "get_user_pages() + file-backed memory" problem
> [1], and to support various COW-related fixes as well [2], we need to
> convert the Direct IO code from get_user_pages_fast(), to
> pin_user_pages_fast(). Because pin_user_pages*() calls require a
> corresponding call to unpin_user_page(), the conversion is more
> elaborate than just substitution.
> 
> Further complicating the conversion, the block/bio layers get their
> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
> each of which has a large number of callers. All of those callers need
> to be audited and changed so that they call unpin_user_page(), rather
> than put_page().
> 
> After quite some time exploring and consulting with people as well, it
> is clear that this cannot be done in just one patchset. That's because,
> not only is this large and time-consuming (for example, Chaitanya
> Kulkarni's first reaction, after looking into the details, was, "convert
> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
> but it is also spread across many filesystems.

With having modified fs/direct-io.c and fs/iomap/direct-io.c which
filesystems do you know are missing conversion? Or is it that you just want
to make sure with audit everything is fine? The only fs I could find
unconverted by your changes is ceph. Am I missing something?

								Honza
David Hildenbrand Feb. 25, 2022, 1:12 p.m. UTC | #2
On 25.02.22 09:50, John Hubbard wrote:
> Hi,
> 
> Summary:
> 
> This puts some prerequisites in place, including a CONFIG parameter,
> making it possible to start converting and testing the Direct IO part of
> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().
> 
> It will take "a few" kernel releases to get the whole thing done.
> 
> Details:
> 
> As part of fixing the "get_user_pages() + file-backed memory" problem
> [1], and to support various COW-related fixes as well [2], we need to
> convert the Direct IO code from get_user_pages_fast(), to
> pin_user_pages_fast(). Because pin_user_pages*() calls require a
> corresponding call to unpin_user_page(), the conversion is more
> elaborate than just substitution.
> 
> Further complicating the conversion, the block/bio layers get their
> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
> each of which has a large number of callers. All of those callers need
> to be audited and changed so that they call unpin_user_page(), rather
> than put_page().

vmsplice is another candidate that uses iov_iter_get_pages() and should
be converted to FOLL_PIN. For that particular user, we have to also pass
FOLL_LONGTERM -- vmsplice as it stands can block memory hotunplug / CMA
/ ... for all eternity.
Chaitanya Kulkarni Feb. 25, 2022, 4:14 p.m. UTC | #3
On 2/25/22 04:05, Jan Kara wrote:
> On Fri 25-02-22 00:50:18, John Hubbard wrote:
>> Hi,
>>
>> Summary:
>>
>> This puts some prerequisites in place, including a CONFIG parameter,
>> making it possible to start converting and testing the Direct IO part of
>> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().
>>
>> It will take "a few" kernel releases to get the whole thing done.
>>
>> Details:
>>
>> As part of fixing the "get_user_pages() + file-backed memory" problem
>> [1], and to support various COW-related fixes as well [2], we need to
>> convert the Direct IO code from get_user_pages_fast(), to
>> pin_user_pages_fast(). Because pin_user_pages*() calls require a
>> corresponding call to unpin_user_page(), the conversion is more
>> elaborate than just substitution.
>>
>> Further complicating the conversion, the block/bio layers get their
>> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
>> each of which has a large number of callers. All of those callers need
>> to be audited and changed so that they call unpin_user_page(), rather
>> than put_page().
>>
>> After quite some time exploring and consulting with people as well, it
>> is clear that this cannot be done in just one patchset. That's because,
>> not only is this large and time-consuming (for example, Chaitanya
>> Kulkarni's first reaction, after looking into the details, was, "convert
>> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
>> but it is also spread across many filesystems.
> 
> With having modified fs/direct-io.c and fs/iomap/direct-io.c which
> filesystems do you know are missing conversion? Or is it that you just want
> to make sure with audit everything is fine? The only fs I could find
> unconverted by your changes is ceph. Am I missing something?

if I understand your comment correctly file systems which are listed in
the list see [1] (all the credit goes to John to have a complete list)
that are not using iomap but use XXX_XXX_direct_IO() should be fine,
since in the callchain going from :-

XXX_XXX_direct_io()
  __blkdev_direct_io()
   do_direct_io()

   ...

     submit_page_selection()
      get/put_page() <---

will take care of itself ?

> 
> 								Honza
> 

[1]

jfs_direct_IO()
nilfs_direct_IO()
ntfs_dirct_IO()
reiserfs_direct_IO()
udf_direct_IO()
ocfs2_dirct_IO()
affs_direct_IO()
exfat_direct_IO()
ext2_direct_IO()
fat_direct_IO()
hfs_direct_IO()
hfs_plus_direct_IO()

-ck
Jan Kara Feb. 25, 2022, 4:40 p.m. UTC | #4
On Fri 25-02-22 16:14:14, Chaitanya Kulkarni wrote:
> On 2/25/22 04:05, Jan Kara wrote:
> > On Fri 25-02-22 00:50:18, John Hubbard wrote:
> >> Hi,
> >>
> >> Summary:
> >>
> >> This puts some prerequisites in place, including a CONFIG parameter,
> >> making it possible to start converting and testing the Direct IO part of
> >> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().
> >>
> >> It will take "a few" kernel releases to get the whole thing done.
> >>
> >> Details:
> >>
> >> As part of fixing the "get_user_pages() + file-backed memory" problem
> >> [1], and to support various COW-related fixes as well [2], we need to
> >> convert the Direct IO code from get_user_pages_fast(), to
> >> pin_user_pages_fast(). Because pin_user_pages*() calls require a
> >> corresponding call to unpin_user_page(), the conversion is more
> >> elaborate than just substitution.
> >>
> >> Further complicating the conversion, the block/bio layers get their
> >> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
> >> each of which has a large number of callers. All of those callers need
> >> to be audited and changed so that they call unpin_user_page(), rather
> >> than put_page().
> >>
> >> After quite some time exploring and consulting with people as well, it
> >> is clear that this cannot be done in just one patchset. That's because,
> >> not only is this large and time-consuming (for example, Chaitanya
> >> Kulkarni's first reaction, after looking into the details, was, "convert
> >> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
> >> but it is also spread across many filesystems.
> > 
> > With having modified fs/direct-io.c and fs/iomap/direct-io.c which
> > filesystems do you know are missing conversion? Or is it that you just want
> > to make sure with audit everything is fine? The only fs I could find
> > unconverted by your changes is ceph. Am I missing something?
> 
> if I understand your comment correctly file systems which are listed in
> the list see [1] (all the credit goes to John to have a complete list)
> that are not using iomap but use XXX_XXX_direct_IO() should be fine,
> since in the callchain going from :-
> 
> XXX_XXX_direct_io()
>   __blkdev_direct_io()
>    do_direct_io()
> 
>    ...
> 
>      submit_page_selection()
>       get/put_page() <---
> 
> will take care of itself ?

Yes, John's changes to fs/direct-io.c should take care of these
filesystems using __blkdev_direct_io().

								Honza

> [1]
> 
> jfs_direct_IO()
> nilfs_direct_IO()
> ntfs_dirct_IO()
> reiserfs_direct_IO()
> udf_direct_IO()
> ocfs2_dirct_IO()
> affs_direct_IO()
> exfat_direct_IO()
> ext2_direct_IO()
> fat_direct_IO()
> hfs_direct_IO()
> hfs_plus_direct_IO()
> 
> -ck
> 
>
John Hubbard Feb. 25, 2022, 7:36 p.m. UTC | #5
On 2/25/22 04:05, Jan Kara wrote:
...
>> After quite some time exploring and consulting with people as well, it
>> is clear that this cannot be done in just one patchset. That's because,
>> not only is this large and time-consuming (for example, Chaitanya
>> Kulkarni's first reaction, after looking into the details, was, "convert
>> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
>> but it is also spread across many filesystems.
> 
> With having modified fs/direct-io.c and fs/iomap/direct-io.c which
> filesystems do you know are missing conversion? Or is it that you just want
> to make sure with audit everything is fine? The only fs I could find
> unconverted by your changes is ceph. Am I missing something?
> 
> 								Honza

There are a few more filesystems that call iov_iter_get_pages() or
iov_iter_get_pages_alloc(), plus networking things as well, plus some
others that are hard to categorize, such as vhost. So we have:

* ceph
* rds
* cifs
* p9
* net: __zerocopy_sg_from_iter(), tls_setup_from_iter(),
* crypto: af_alg_make_sg() (maybe N/A)
* vmsplice() (as David Hildenbrand mentioned)
* vhost: vhost_scsi_map_to_sgl()

In addition to that, I was also worried that maybe the
blockdev_direct_IO() or iomap filesystems might be breaking encapsulation
occasionally, by calling put_page() on the direct IO user page buffer.
Perhaps in error paths.

Are you pretty sure that that last concern is not valid? That would be a
welcome bit of news.



thanks,
John Hubbard Feb. 25, 2022, 9:10 p.m. UTC | #6
On 2/25/22 05:12, David Hildenbrand wrote:
> vmsplice is another candidate that uses iov_iter_get_pages() and should

Yes.

> be converted to FOLL_PIN. For that particular user, we have to also pass
> FOLL_LONGTERM -- vmsplice as it stands can block memory hotunplug / CMA
> / ... for all eternity.
> 

Oh, thanks for pointing that out. I was thinking about Direct IO, and
would have overlooked that vmsplice need FOLL_LONGTERM.

I'm still quite vague about which parts of vmsplice are in pipe buffers
(and therefore will *not* get FOLL_PIN pages) and which are in user space
buffers. Just haven't spent enough time with vmsplice yet, hopefully it
will clear up with some study of the code.


thanks,
John Hubbard Feb. 25, 2022, 10:20 p.m. UTC | #7
On 2/25/22 11:36, John Hubbard wrote:
> On 2/25/22 04:05, Jan Kara wrote:
> ...
>> With having modified fs/direct-io.c and fs/iomap/direct-io.c which
>> filesystems do you know are missing conversion? Or is it that you just want
>> to make sure with audit everything is fine? The only fs I could find
>> unconverted by your changes is ceph. Am I missing something?
>>
>>                                 Honza
> 
> There are a few more filesystems that call iov_iter_get_pages() or
> iov_iter_get_pages_alloc(), plus networking things as well, plus some
> others that are hard to categorize, such as vhost. So we have:
> 
> * ceph
> * rds
> * cifs
> * p9
> * net: __zerocopy_sg_from_iter(), tls_setup_from_iter(),
> * crypto: af_alg_make_sg() (maybe N/A)
> * vmsplice() (as David Hildenbrand mentioned)
> * vhost: vhost_scsi_map_to_sgl()

...although...if each filesystem does *not* require custom attention,
then there is another, maybe better approach, which is: factor out an
iovec-only pair of allocators, and transition each subsystem when it's
ready. So:

     dio_iov_iter_get_pages()
     dio_iov_iter_get_pages_alloc()

That would allow doing this a bit at a time, and without the horrible
CONFIG parameter that is switched over all at once.

The bio_release_pages() still calls unpin_user_page(), though, so that
means that all Direct IO callers would still have to change over at
once.


thanks,