Message ID | 20220225085025.3052894-1-jhubbard@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | block, fs: convert Direct IO to FOLL_PIN | expand |
On Fri 25-02-22 00:50:18, John Hubbard wrote: > Hi, > > Summary: > > This puts some prerequisites in place, including a CONFIG parameter, > making it possible to start converting and testing the Direct IO part of > each filesystem, from get_user_pages_fast(), to pin_user_pages_fast(). > > It will take "a few" kernel releases to get the whole thing done. > > Details: > > As part of fixing the "get_user_pages() + file-backed memory" problem > [1], and to support various COW-related fixes as well [2], we need to > convert the Direct IO code from get_user_pages_fast(), to > pin_user_pages_fast(). Because pin_user_pages*() calls require a > corresponding call to unpin_user_page(), the conversion is more > elaborate than just substitution. > > Further complicating the conversion, the block/bio layers get their > Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(), > each of which has a large number of callers. All of those callers need > to be audited and changed so that they call unpin_user_page(), rather > than put_page(). > > After quite some time exploring and consulting with people as well, it > is clear that this cannot be done in just one patchset. That's because, > not only is this large and time-consuming (for example, Chaitanya > Kulkarni's first reaction, after looking into the details, was, "convert > the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."), > but it is also spread across many filesystems. With having modified fs/direct-io.c and fs/iomap/direct-io.c which filesystems do you know are missing conversion? Or is it that you just want to make sure with audit everything is fine? The only fs I could find unconverted by your changes is ceph. Am I missing something? Honza
On 25.02.22 09:50, John Hubbard wrote: > Hi, > > Summary: > > This puts some prerequisites in place, including a CONFIG parameter, > making it possible to start converting and testing the Direct IO part of > each filesystem, from get_user_pages_fast(), to pin_user_pages_fast(). > > It will take "a few" kernel releases to get the whole thing done. > > Details: > > As part of fixing the "get_user_pages() + file-backed memory" problem > [1], and to support various COW-related fixes as well [2], we need to > convert the Direct IO code from get_user_pages_fast(), to > pin_user_pages_fast(). Because pin_user_pages*() calls require a > corresponding call to unpin_user_page(), the conversion is more > elaborate than just substitution. > > Further complicating the conversion, the block/bio layers get their > Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(), > each of which has a large number of callers. All of those callers need > to be audited and changed so that they call unpin_user_page(), rather > than put_page(). vmsplice is another candidate that uses iov_iter_get_pages() and should be converted to FOLL_PIN. For that particular user, we have to also pass FOLL_LONGTERM -- vmsplice as it stands can block memory hotunplug / CMA / ... for all eternity.
On 2/25/22 04:05, Jan Kara wrote: > On Fri 25-02-22 00:50:18, John Hubbard wrote: >> Hi, >> >> Summary: >> >> This puts some prerequisites in place, including a CONFIG parameter, >> making it possible to start converting and testing the Direct IO part of >> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast(). >> >> It will take "a few" kernel releases to get the whole thing done. >> >> Details: >> >> As part of fixing the "get_user_pages() + file-backed memory" problem >> [1], and to support various COW-related fixes as well [2], we need to >> convert the Direct IO code from get_user_pages_fast(), to >> pin_user_pages_fast(). Because pin_user_pages*() calls require a >> corresponding call to unpin_user_page(), the conversion is more >> elaborate than just substitution. >> >> Further complicating the conversion, the block/bio layers get their >> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(), >> each of which has a large number of callers. All of those callers need >> to be audited and changed so that they call unpin_user_page(), rather >> than put_page(). >> >> After quite some time exploring and consulting with people as well, it >> is clear that this cannot be done in just one patchset. That's because, >> not only is this large and time-consuming (for example, Chaitanya >> Kulkarni's first reaction, after looking into the details, was, "convert >> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."), >> but it is also spread across many filesystems. > > With having modified fs/direct-io.c and fs/iomap/direct-io.c which > filesystems do you know are missing conversion? Or is it that you just want > to make sure with audit everything is fine? The only fs I could find > unconverted by your changes is ceph. Am I missing something? if I understand your comment correctly file systems which are listed in the list see [1] (all the credit goes to John to have a complete list) that are not using iomap but use XXX_XXX_direct_IO() should be fine, since in the callchain going from :- XXX_XXX_direct_io() __blkdev_direct_io() do_direct_io() ... submit_page_selection() get/put_page() <--- will take care of itself ? > > Honza > [1] jfs_direct_IO() nilfs_direct_IO() ntfs_dirct_IO() reiserfs_direct_IO() udf_direct_IO() ocfs2_dirct_IO() affs_direct_IO() exfat_direct_IO() ext2_direct_IO() fat_direct_IO() hfs_direct_IO() hfs_plus_direct_IO() -ck
On Fri 25-02-22 16:14:14, Chaitanya Kulkarni wrote: > On 2/25/22 04:05, Jan Kara wrote: > > On Fri 25-02-22 00:50:18, John Hubbard wrote: > >> Hi, > >> > >> Summary: > >> > >> This puts some prerequisites in place, including a CONFIG parameter, > >> making it possible to start converting and testing the Direct IO part of > >> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast(). > >> > >> It will take "a few" kernel releases to get the whole thing done. > >> > >> Details: > >> > >> As part of fixing the "get_user_pages() + file-backed memory" problem > >> [1], and to support various COW-related fixes as well [2], we need to > >> convert the Direct IO code from get_user_pages_fast(), to > >> pin_user_pages_fast(). Because pin_user_pages*() calls require a > >> corresponding call to unpin_user_page(), the conversion is more > >> elaborate than just substitution. > >> > >> Further complicating the conversion, the block/bio layers get their > >> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(), > >> each of which has a large number of callers. All of those callers need > >> to be audited and changed so that they call unpin_user_page(), rather > >> than put_page(). > >> > >> After quite some time exploring and consulting with people as well, it > >> is clear that this cannot be done in just one patchset. That's because, > >> not only is this large and time-consuming (for example, Chaitanya > >> Kulkarni's first reaction, after looking into the details, was, "convert > >> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."), > >> but it is also spread across many filesystems. > > > > With having modified fs/direct-io.c and fs/iomap/direct-io.c which > > filesystems do you know are missing conversion? Or is it that you just want > > to make sure with audit everything is fine? The only fs I could find > > unconverted by your changes is ceph. Am I missing something? > > if I understand your comment correctly file systems which are listed in > the list see [1] (all the credit goes to John to have a complete list) > that are not using iomap but use XXX_XXX_direct_IO() should be fine, > since in the callchain going from :- > > XXX_XXX_direct_io() > __blkdev_direct_io() > do_direct_io() > > ... > > submit_page_selection() > get/put_page() <--- > > will take care of itself ? Yes, John's changes to fs/direct-io.c should take care of these filesystems using __blkdev_direct_io(). Honza > [1] > > jfs_direct_IO() > nilfs_direct_IO() > ntfs_dirct_IO() > reiserfs_direct_IO() > udf_direct_IO() > ocfs2_dirct_IO() > affs_direct_IO() > exfat_direct_IO() > ext2_direct_IO() > fat_direct_IO() > hfs_direct_IO() > hfs_plus_direct_IO() > > -ck > >
On 2/25/22 04:05, Jan Kara wrote: ... >> After quite some time exploring and consulting with people as well, it >> is clear that this cannot be done in just one patchset. That's because, >> not only is this large and time-consuming (for example, Chaitanya >> Kulkarni's first reaction, after looking into the details, was, "convert >> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."), >> but it is also spread across many filesystems. > > With having modified fs/direct-io.c and fs/iomap/direct-io.c which > filesystems do you know are missing conversion? Or is it that you just want > to make sure with audit everything is fine? The only fs I could find > unconverted by your changes is ceph. Am I missing something? > > Honza There are a few more filesystems that call iov_iter_get_pages() or iov_iter_get_pages_alloc(), plus networking things as well, plus some others that are hard to categorize, such as vhost. So we have: * ceph * rds * cifs * p9 * net: __zerocopy_sg_from_iter(), tls_setup_from_iter(), * crypto: af_alg_make_sg() (maybe N/A) * vmsplice() (as David Hildenbrand mentioned) * vhost: vhost_scsi_map_to_sgl() In addition to that, I was also worried that maybe the blockdev_direct_IO() or iomap filesystems might be breaking encapsulation occasionally, by calling put_page() on the direct IO user page buffer. Perhaps in error paths. Are you pretty sure that that last concern is not valid? That would be a welcome bit of news. thanks,
On 2/25/22 05:12, David Hildenbrand wrote: > vmsplice is another candidate that uses iov_iter_get_pages() and should Yes. > be converted to FOLL_PIN. For that particular user, we have to also pass > FOLL_LONGTERM -- vmsplice as it stands can block memory hotunplug / CMA > / ... for all eternity. > Oh, thanks for pointing that out. I was thinking about Direct IO, and would have overlooked that vmsplice need FOLL_LONGTERM. I'm still quite vague about which parts of vmsplice are in pipe buffers (and therefore will *not* get FOLL_PIN pages) and which are in user space buffers. Just haven't spent enough time with vmsplice yet, hopefully it will clear up with some study of the code. thanks,
On 2/25/22 11:36, John Hubbard wrote: > On 2/25/22 04:05, Jan Kara wrote: > ... >> With having modified fs/direct-io.c and fs/iomap/direct-io.c which >> filesystems do you know are missing conversion? Or is it that you just want >> to make sure with audit everything is fine? The only fs I could find >> unconverted by your changes is ceph. Am I missing something? >> >> Honza > > There are a few more filesystems that call iov_iter_get_pages() or > iov_iter_get_pages_alloc(), plus networking things as well, plus some > others that are hard to categorize, such as vhost. So we have: > > * ceph > * rds > * cifs > * p9 > * net: __zerocopy_sg_from_iter(), tls_setup_from_iter(), > * crypto: af_alg_make_sg() (maybe N/A) > * vmsplice() (as David Hildenbrand mentioned) > * vhost: vhost_scsi_map_to_sgl() ...although...if each filesystem does *not* require custom attention, then there is another, maybe better approach, which is: factor out an iovec-only pair of allocators, and transition each subsystem when it's ready. So: dio_iov_iter_get_pages() dio_iov_iter_get_pages_alloc() That would allow doing this a bit at a time, and without the horrible CONFIG parameter that is switched over all at once. The bio_release_pages() still calls unpin_user_page(), though, so that means that all Direct IO callers would still have to change over at once. thanks,