diff mbox series

[v7,2/8] iov_iter: Add a function to extract a page list from an iterator

Message ID 20230120175556.3556978-3-dhowells@redhat.com (mailing list archive)
State New
Headers show
Series None | expand

Commit Message

David Howells Jan. 20, 2023, 5:55 p.m. UTC
Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator.  The pages may be returned with a reference added or a pin
added or neither, depending on the type of iterator and the direction of
transfer.  The caller must pass FOLL_READ_FROM_MEM or FOLL_WRITE_TO_MEM
as part of gup_flags to indicate how the iterator contents are to be used.

Add a second function, iov_iter_extract_mode(), to determine how the
cleanup should be done.

There are three cases:

 (1) Transfer *into* an ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have pins obtained on them (but not references)
     so that fork() doesn't CoW the pages incorrectly whilst the I/O is in
     progress.

     iov_iter_extract_mode() will return FOLL_PIN for this case.  The
     caller should use something like unpin_user_page() to dispose of the
     page.

 (2) Transfer is *out of* an ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have references obtained on them, but not pins.

     iov_iter_extract_mode() will return FOLL_GET.  The caller should use
     something like put_page() for page disposal.

 (3) Any other sort of iterator.

     No refs or pins are obtained on the page, the assumption is made that
     the caller will manage page retention.  ITER_ALLOW_P2PDMA is not
     permitted.

     iov_iter_extract_mode() will return 0.  The pages don't need
     additional disposal.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: John Hubbard <jhubbard@nvidia.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/166920903885.1461876.692029808682876184.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997421646.9475.14837976344157464997.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305163883.1521586.10777155475378874823.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344728530.2425628.9613910866466387722.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391053207.2311931.16398133457201442907.stgit@warthog.procyon.org.uk/ # v6
---

Notes:
    ver #7)
     - Switch to passing in iter-specific flags rather than FOLL_* flags.
     - Drop the direction flags for now.
     - Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
     - Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
     - Add support for extraction from KVEC-type iters.
     - Use iov_iter_advance() rather than open-coding it.
     - Make BVEC- and KVEC-type skip over initial empty vectors.
    
    ver #6)
     - Add back the function to indicate the cleanup mode.
     - Drop the cleanup_mode return arg to iov_iter_extract_pages().
     - Pass FOLL_SOURCE/DEST_BUF in gup_flags.  Check this against the iter
       data_source.
    
    ver #4)
     - Use ITER_SOURCE/DEST instead of WRITE/READ.
     - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.
    
    ver #3)
     - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
       to get/pin_user_pages_fast()[1].

 include/linux/uio.h |  28 +++
 lib/iov_iter.c      | 424 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 452 insertions(+)

Comments

Christoph Hellwig Jan. 21, 2023, 1:01 p.m. UTC | #1
> +#define iov_iter_extract_mode(iter, extract_flags) \
> +	(user_backed_iter(iter) ?				\
> +	 (iter->data_source == ITER_SOURCE) ?			\
> +	 FOLL_GET : FOLL_PIN : 0)

The extract_flags argument is unused here.
Christoph Hellwig Jan. 21, 2023, 1:10 p.m. UTC | #2
On Fri, Jan 20, 2023 at 05:55:50PM +0000, David Howells wrote:
>  (3) Any other sort of iterator.
> 
>      No refs or pins are obtained on the page, the assumption is made that
>      the caller will manage page retention.  ITER_ALLOW_P2PDMA is not
>      permitted.

The ITER_ALLOW_P2PDMA check here is fundamentally wrong and will
break things like io_uring to fixed buffers on NVMe.  ITER_ALLOW_P2PDMA
should strictly increase the group of acceptable iters, it does must
not restrict anything.
David Howells Jan. 21, 2023, 1:30 p.m. UTC | #3
Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, Jan 20, 2023 at 05:55:50PM +0000, David Howells wrote:
> >  (3) Any other sort of iterator.
> > 
> >      No refs or pins are obtained on the page, the assumption is made that
> >      the caller will manage page retention.  ITER_ALLOW_P2PDMA is not
> >      permitted.
> 
> The ITER_ALLOW_P2PDMA check here is fundamentally wrong and will
> break things like io_uring to fixed buffers on NVMe.  ITER_ALLOW_P2PDMA
> should strictly increase the group of acceptable iters, it does must
> not restrict anything.

So just drop the check?  Or do I actually need to do something to the pages to
make it work?  If I understand the code correctly, it's not actually operable
right now on any page that doesn't have an appropriately marked PTE.

David
Christoph Hellwig Jan. 21, 2023, 1:33 p.m. UTC | #4
On Sat, Jan 21, 2023 at 01:30:47PM +0000, David Howells wrote:
> So just drop the check?  Or do I actually need to do something to the pages to
> make it work?  If I understand the code correctly, it's not actually operable
> right now on any page that doesn't have an appropriately marked PTE.

Yes, just drop the check. The flag just signals that the caller can
deal with p2p pages.  If it gets any or not is a totally different
story.
David Hildenbrand Jan. 23, 2023, 11:28 a.m. UTC | #5
On 20.01.23 18:55, David Howells wrote:
> Add a function, iov_iter_extract_pages(), to extract a list of pages from
> an iterator.  The pages may be returned with a reference added or a pin
> added or neither, depending on the type of iterator and the direction of
> transfer.  The caller must pass FOLL_READ_FROM_MEM or FOLL_WRITE_TO_MEM
> as part of gup_flags to indicate how the iterator contents are to be used.
> 
> Add a second function, iov_iter_extract_mode(), to determine how the
> cleanup should be done.
> 
> There are three cases:
> 
>   (1) Transfer *into* an ITER_IOVEC or ITER_UBUF iterator.
> 
>       Extracted pages will have pins obtained on them (but not references)
>       so that fork() doesn't CoW the pages incorrectly whilst the I/O is in
>       progress.
> 
>       iov_iter_extract_mode() will return FOLL_PIN for this case.  The
>       caller should use something like unpin_user_page() to dispose of the
>       page.
> 
>   (2) Transfer is *out of* an ITER_IOVEC or ITER_UBUF iterator.
> 
>       Extracted pages will have references obtained on them, but not pins.
> 
>       iov_iter_extract_mode() will return FOLL_GET.  The caller should use
>       something like put_page() for page disposal.
> 
>   (3) Any other sort of iterator.
> 
>       No refs or pins are obtained on the page, the assumption is made that
>       the caller will manage page retention.  ITER_ALLOW_P2PDMA is not
>       permitted.
> 
>       iov_iter_extract_mode() will return 0.  The pages don't need
>       additional disposal.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Christoph Hellwig <hch@lst.de>
> cc: John Hubbard <jhubbard@nvidia.com>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-mm@kvack.org
> Link: https://lore.kernel.org/r/166920903885.1461876.692029808682876184.stgit@warthog.procyon.org.uk/ # v2
> Link: https://lore.kernel.org/r/166997421646.9475.14837976344157464997.stgit@warthog.procyon.org.uk/ # v3
> Link: https://lore.kernel.org/r/167305163883.1521586.10777155475378874823.stgit@warthog.procyon.org.uk/ # v4
> Link: https://lore.kernel.org/r/167344728530.2425628.9613910866466387722.stgit@warthog.procyon.org.uk/ # v5
> Link: https://lore.kernel.org/r/167391053207.2311931.16398133457201442907.stgit@warthog.procyon.org.uk/ # v6
> ---
> 
> Notes:
>      ver #7)
>       - Switch to passing in iter-specific flags rather than FOLL_* flags.
>       - Drop the direction flags for now.
>       - Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
>       - Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
>       - Add support for extraction from KVEC-type iters.
>       - Use iov_iter_advance() rather than open-coding it.
>       - Make BVEC- and KVEC-type skip over initial empty vectors.
>      
>      ver #6)
>       - Add back the function to indicate the cleanup mode.
>       - Drop the cleanup_mode return arg to iov_iter_extract_pages().
>       - Pass FOLL_SOURCE/DEST_BUF in gup_flags.  Check this against the iter
>         data_source.
>      
>      ver #4)
>       - Use ITER_SOURCE/DEST instead of WRITE/READ.
>       - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.
>      
>      ver #3)
>       - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
>         to get/pin_user_pages_fast()[1].
> 
>   include/linux/uio.h |  28 +++
>   lib/iov_iter.c      | 424 ++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 452 insertions(+)
> 
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 46d5080314c6..a4233049ab7a 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -363,4 +363,32 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
>   /* Flags for iov_iter_get/extract_pages*() */
>   #define ITER_ALLOW_P2PDMA	0x01	/* Allow P2PDMA on the extracted pages */
>   
> +ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
> +			       size_t maxsize, unsigned int maxpages,
> +			       unsigned int extract_flags, size_t *offset0);
> +
> +/**
> + * iov_iter_extract_mode - Indicate how pages from the iterator will be retained
> + * @iter: The iterator
> + * @extract_flags: How the iterator is to be used
> + *
> + * Examine the iterator and @extract_flags and indicate by returning FOLL_PIN,
> + * FOLL_GET or 0 as to how, if at all, pages extracted from the iterator will
> + * be retained by the extraction function.
> + *
> + * FOLL_GET indicates that the pages will have a reference taken on them that
> + * the caller must put.  This can be done for DMA/async DIO write from a page.
> + *
> + * FOLL_PIN indicates that the pages will have a pin placed in them that the
> + * caller must unpin.  This is must be done for DMA/async DIO read to a page to
> + * avoid CoW problems in fork.
> + *
> + * 0 indicates that no measures are taken and that it's up to the caller to
> + * retain the pages.
> + */
> +#define iov_iter_extract_mode(iter, extract_flags) \
> +	(user_backed_iter(iter) ?				\
> +	 (iter->data_source == ITER_SOURCE) ?			\
> +	 FOLL_GET : FOLL_PIN : 0)
> +
>

How does this work align with the goal of no longer using FOLL_GET for 
O_DIRECT? We should get rid of any FOLL_GET usage for accessing page 
content.

@John, any comments?
David Howells Jan. 23, 2023, 11:51 a.m. UTC | #6
David Hildenbrand <david@redhat.com> wrote:

> How does this work align with the goal of no longer using FOLL_GET for
> O_DIRECT? We should get rid of any FOLL_GET usage for accessing page content.

Would that run the risk of changes being made by the child being visible to
the a DIO write if the parent changes the buffer first?


	PARENT			CHILD
	======			=====
	start-DIO-write
	fork() = pid		fork() = 0
	alter-buffer
	CoW happens
	page copied		original page retained
				alter-buffer
		<DMA-happens>

David
David Howells Jan. 23, 2023, noon UTC | #7
David Howells <dhowells@redhat.com> wrote:

> > How does this work align with the goal of no longer using FOLL_GET for
> > O_DIRECT? We should get rid of any FOLL_GET usage for accessing page content.
> 
> Would that run the risk of changes being made by the child being visible to
> the a DIO write if the parent changes the buffer first?
> 
> 
> 	PARENT			CHILD
> 	======			=====
> 	start-DIO-write
> 	fork() = pid		fork() = 0
> 	alter-buffer
> 	CoW happens
> 	page copied		original page retained
> 				alter-buffer
> 		<DMA-happens>

Ah, I think I might have got the wrong end of the stick.  A pinned page is
*always* copied on fork() if I understand copy_present_pte() correctly.

David
David Hildenbrand Jan. 23, 2023, 1:11 p.m. UTC | #8
On 23.01.23 12:51, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> How does this work align with the goal of no longer using FOLL_GET for
>> O_DIRECT? We should get rid of any FOLL_GET usage for accessing page content.
> 
> Would that run the risk of changes being made by the child being visible to
> the a DIO write if the parent changes the buffer first?
> 
> 
> 	PARENT			CHILD
> 	======			=====
> 	start-DIO-write
> 	fork() = pid		fork() = 0
> 	alter-buffer
> 	CoW happens
> 	page copied		original page retained
> 				alter-buffer
> 		<DMA-happens>

FOLL_PIN users are fine in that regard, because we properly detect 
"maybe pinned" during fork() and copy the page. See 
tools/testing/selftests/mm/cow.c (still called 
tools/testing/selftests/vm/cow.c upstream IIRC) for some test cases for 
that handling.

FOLL_GET does not work as expected in that regard: pages can't be 
detected as pinned and we won't be copying them during fork(). We'll end 
up COW-sharing them, which can result in trouble later.

Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure 
what the status is. Interestingly, 
Documentation/core-api/pin_user_pages.rst already documents that "CASE 
1: Direct IO (DIO)" uses FOLL_PIN ... which does, unfortunately, no 
reflect reality yet.
David Howells Jan. 23, 2023, 1:19 p.m. UTC | #9
David Hildenbrand <david@redhat.com> wrote:

> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
> unfortunately, no reflect reality yet.

Yeah - I just came across that.

Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
get_user_pages()?  In which case my patches only need keep track of
pinned/not-pinned and never "got".

David
David Hildenbrand Jan. 23, 2023, 1:24 p.m. UTC | #10
On 23.01.23 14:19, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
>> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
>> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
>> unfortunately, no reflect reality yet.
> 
> Yeah - I just came across that.
> 
> Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
> get_user_pages()?  In which case my patches only need keep track of
> pinned/not-pinned and never "got".

That would be the ideal case: whenever intending to access page content, 
use FOLL_PIN instead of FOLL_GET.

The issue that John was trying to sort out was that there are plenty of 
callsites that do a simple put_page() instead of calling 
unpin_user_page(). IIRC, handling that correctly in existing code -- 
what was pinned must be released via unpin_user_page() -- was the 
biggest workitem.

Not sure how that relates to your work here (that's why I was asking): 
if you could avoid FOLL_GET, that would be great :)
David Howells Jan. 23, 2023, 1:38 p.m. UTC | #11
David Hildenbrand <david@redhat.com> wrote:

> That would be the ideal case: whenever intending to access page content, use
> FOLL_PIN instead of FOLL_GET.
> 
> The issue that John was trying to sort out was that there are plenty of
> callsites that do a simple put_page() instead of calling
> unpin_user_page(). IIRC, handling that correctly in existing code -- what was
> pinned must be released via unpin_user_page() -- was the biggest workitem.
> 
> Not sure how that relates to your work here (that's why I was asking): if you
> could avoid FOLL_GET, that would be great :)

Well, it simplifies things a bit.

I can make the new iov_iter_extract_pages() just do "pin" or "don't pin" and
do no ref-getting at all.  Things can be converted over to "unpin the pages or
doing nothing" as they're converted over to using iov_iter_extract_pages()
from iov_iter_get_pages*().

The block bio code then only needs a single bit of state: pinned or not
pinned.

For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
a special cleanup?

sk_buff fragment handling could still be tricky.  I'm thinking that in that
code I'll need to store FOLL_GET/PIN in the bottom two bits of the frag page
pointer.  Sometimes it allocates a new page and attaches it (have ref);
sometimes it does zerocopy to/from a page (have pin) and sometimes it may be
pointing to a kernel buffer (don't pin or ref).

David
David Hildenbrand Jan. 23, 2023, 2:20 p.m. UTC | #12
On 23.01.23 14:38, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> That would be the ideal case: whenever intending to access page content, use
>> FOLL_PIN instead of FOLL_GET.
>>
>> The issue that John was trying to sort out was that there are plenty of
>> callsites that do a simple put_page() instead of calling
>> unpin_user_page(). IIRC, handling that correctly in existing code -- what was
>> pinned must be released via unpin_user_page() -- was the biggest workitem.
>>
>> Not sure how that relates to your work here (that's why I was asking): if you
>> could avoid FOLL_GET, that would be great :)
> 
> Well, it simplifies things a bit.
> 
> I can make the new iov_iter_extract_pages() just do "pin" or "don't pin" and
> do no ref-getting at all.  Things can be converted over to "unpin the pages or
> doing nothing" as they're converted over to using iov_iter_extract_pages()
> from iov_iter_get_pages*().
> 
> The block bio code then only needs a single bit of state: pinned or not
> pinned.

Unfortunately, I'll have to let BIO experts comment on that :) I only 
know the MM side of things here.

> 
> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
> a special cleanup?

Anything that holds pins "possibly forever" should that. vmsplice() is 
another example that should use it, once properly using FOLL_PIN. 
[FOLL_GET | FOLL_LONGTERM is not really used/defined with semantics]

> 
> sk_buff fragment handling could still be tricky.  I'm thinking that in that
> code I'll need to store FOLL_GET/PIN in the bottom two bits of the frag page
> pointer.  Sometimes it allocates a new page and attaches it (have ref);
> sometimes it does zerocopy to/from a page (have pin) and sometimes it may be
> pointing to a kernel buffer (don't pin or ref).
> 
> David
>
Christoph Hellwig Jan. 23, 2023, 2:48 p.m. UTC | #13
On Mon, Jan 23, 2023 at 03:20:41PM +0100, David Hildenbrand wrote:
> > The block bio code then only needs a single bit of state: pinned or not
> > pinned.
> 
> Unfortunately, I'll have to let BIO experts comment on that :) I only know
> the MM side of things here.

It can work for bios.
Jan Kara Jan. 23, 2023, 4:11 p.m. UTC | #14
On Mon 23-01-23 13:38:31, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
> > That would be the ideal case: whenever intending to access page content, use
> > FOLL_PIN instead of FOLL_GET.
> > 
> > The issue that John was trying to sort out was that there are plenty of
> > callsites that do a simple put_page() instead of calling
> > unpin_user_page(). IIRC, handling that correctly in existing code -- what was
> > pinned must be released via unpin_user_page() -- was the biggest workitem.
> > 
> > Not sure how that relates to your work here (that's why I was asking): if you
> > could avoid FOLL_GET, that would be great :)
> 
> Well, it simplifies things a bit.
> 
> I can make the new iov_iter_extract_pages() just do "pin" or "don't pin" and
> do no ref-getting at all.  Things can be converted over to "unpin the pages or
> doing nothing" as they're converted over to using iov_iter_extract_pages()
> from iov_iter_get_pages*().
> 
> The block bio code then only needs a single bit of state: pinned or not
> pinned.

I'm all for using only pin/unpin in the end. But you'd still need to handle
the 'put' for the intermediate time when there are still FOLL_GET users of
the bio infrastructure, wouldn't you?

> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
> a special cleanup?

FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
whenever there isn't reasonably bound time after which the page is
unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
long the RDMA is going to be running it should be using FOLL_LONGTERM. The
thing is that pins can block e.g. truncate for DAX inodes and so longterm
pins are not supported for DAX backed pages.

								Honza
Christoph Hellwig Jan. 23, 2023, 4:17 p.m. UTC | #15
On Mon, Jan 23, 2023 at 05:11:14PM +0100, Jan Kara wrote:
> I'm all for using only pin/unpin in the end. But you'd still need to handle
> the 'put' for the intermediate time when there are still FOLL_GET users of
> the bio infrastructure, wouldn't you?

Yes, we need it for the transition.  But Dave already has a patch to
convert the legacy direct-io code as well, at which point we can
the retire the get bit.
John Hubbard Jan. 23, 2023, 7:56 p.m. UTC | #16
On 1/23/23 05:24, David Hildenbrand wrote:
> On 23.01.23 14:19, David Howells wrote:
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
>>> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
>>> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
>>> unfortunately, no reflect reality yet.
>>

Yes, that part of the documentation is...aspirational. :) But this
series is taking us there, so good. Let me go review v8 of the series in
actual detail to verify but it sounds very promising.

>> Yeah - I just came across that.
>>
>> Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
>> get_user_pages()?  In which case my patches only need keep track of
>> pinned/not-pinned and never "got".
> 
> That would be the ideal case: whenever intending to access page content, use FOLL_PIN instead of FOLL_GET.
> 
> The issue that John was trying to sort out was that there are plenty of callsites that do a simple put_page() instead of calling unpin_user_page(). IIRC, handling that correctly in existing code -- what was pinned must be released via unpin_user_page() -- was the biggest workitem.
> 
> Not sure how that relates to your work here (that's why I was asking): if you could avoid FOLL_GET, that would be great :)
> 

The largest part of this problem has been: __iov_iter_get_pages_alloc()
calls get_user_pages_fast() (so, FOLL_GET), as part of the Direct IO
path. And that __iov_iter_get_pages_alloc() is also called by a wide
variety of things that are not Direct IO: networking, crytpo, RDS.

So splitting out a variant that only Direct IO uses is a great move and
should allow conversion of Direct IO to FOLL_PIN. Again, let me go do an
actual review to check on that.

thanks,
John Hubbard Jan. 23, 2023, 11:07 p.m. UTC | #17
On 1/23/23 08:11, Jan Kara wrote:
>> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
>> a special cleanup?
> 
> FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
> whenever there isn't reasonably bound time after which the page is
> unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
> long the RDMA is going to be running it should be using FOLL_LONGTERM. The

Yes, we have been pretty consistently deciding that RDMA generally
implies FOLL_LONGTERM. (And furthermore, FOLL_LONGTERM implies
FOLL_PIN--that one is actually enforced by the gup/pup APIs.)


thanks,
Christoph Hellwig Jan. 24, 2023, 5:57 a.m. UTC | #18
On Mon, Jan 23, 2023 at 03:07:48PM -0800, John Hubbard wrote:
> On 1/23/23 08:11, Jan Kara wrote:
> > > For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
> > > a special cleanup?
> > 
> > FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
> > whenever there isn't reasonably bound time after which the page is
> > unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
> > long the RDMA is going to be running it should be using FOLL_LONGTERM. The
> 
> Yes, we have been pretty consistently deciding that RDMA generally
> implies FOLL_LONGTERM. (And furthermore, FOLL_LONGTERM implies
> FOLL_PIN--that one is actually enforced by the gup/pup APIs.)

That's weird.  For storage or file systems, pages are pinnen just as
long when using RDMA as when using local DMA, in fact if you do RDMA
to really fast remote media vs slow local media (e.g. SSD vs disk) you
might pin it shorter when using RDMA.

I think FOLL_LONGTERM makes sense for non-ODP user space memory
registrations for RDMA, which will last basically forever.  It does
not really make much sense at all for in-kernel memory registration for
RDMA that are very short term.
John Hubbard Jan. 24, 2023, 6:55 a.m. UTC | #19
On 1/23/23 21:57, Christoph Hellwig wrote:
> On Mon, Jan 23, 2023 at 03:07:48PM -0800, John Hubbard wrote:
>> On 1/23/23 08:11, Jan Kara wrote:
>>>> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
>>>> a special cleanup?
>>>
>>> FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
>>> whenever there isn't reasonably bound time after which the page is
>>> unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
>>> long the RDMA is going to be running it should be using FOLL_LONGTERM. The
>>
>> Yes, we have been pretty consistently deciding that RDMA generally
>> implies FOLL_LONGTERM. (And furthermore, FOLL_LONGTERM implies
>> FOLL_PIN--that one is actually enforced by the gup/pup APIs.)
> 
> That's weird.  For storage or file systems, pages are pinnen just as
> long when using RDMA as when using local DMA, in fact if you do RDMA
> to really fast remote media vs slow local media (e.g. SSD vs disk) you
> might pin it shorter when using RDMA.

ah OK, it is possible that I've been thinking too much about the HPC
cases that set up RDMA and leave it there, and not enough about
storage.

> 
> I think FOLL_LONGTERM makes sense for non-ODP user space memory
> registrations for RDMA, which will last basically forever.  It does
> not really make much sense at all for in-kernel memory registration for
> RDMA that are very short term.

Agreed.


thanks,
Al Viro Jan. 26, 2023, 10:15 p.m. UTC | #20
On Mon, Jan 23, 2023 at 02:24:13PM +0100, David Hildenbrand wrote:
> On 23.01.23 14:19, David Howells wrote:
> > David Hildenbrand <david@redhat.com> wrote:
> > 
> > > Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
> > > the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
> > > already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
> > > unfortunately, no reflect reality yet.
> > 
> > Yeah - I just came across that.
> > 
> > Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
> > get_user_pages()?  In which case my patches only need keep track of
> > pinned/not-pinned and never "got".
> 
> That would be the ideal case: whenever intending to access page content, use
> FOLL_PIN instead of FOLL_GET.
> 
> The issue that John was trying to sort out was that there are plenty of
> callsites that do a simple put_page() instead of calling unpin_user_page().
> IIRC, handling that correctly in existing code -- what was pinned must be
> released via unpin_user_page() -- was the biggest workitem.
> 
> Not sure how that relates to your work here (that's why I was asking): if
> you could avoid FOLL_GET, that would be great :)

Take a good look at iter_to_pipe().  It does *not* need to pin anything
(we have an ITER_SOURCE there); with this approach it will.  And it
will stuff those pinned references into a pipe, where they can sit
indefinitely.

IOW, I don't believe it's a usable approach.
David Hildenbrand Jan. 26, 2023, 11:41 p.m. UTC | #21
On 26.01.23 23:15, Al Viro wrote:
> On Mon, Jan 23, 2023 at 02:24:13PM +0100, David Hildenbrand wrote:
>> On 23.01.23 14:19, David Howells wrote:
>>> David Hildenbrand <david@redhat.com> wrote:
>>>
>>>> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
>>>> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
>>>> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
>>>> unfortunately, no reflect reality yet.
>>>
>>> Yeah - I just came across that.
>>>
>>> Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
>>> get_user_pages()?  In which case my patches only need keep track of
>>> pinned/not-pinned and never "got".
>>
>> That would be the ideal case: whenever intending to access page content, use
>> FOLL_PIN instead of FOLL_GET.
>>
>> The issue that John was trying to sort out was that there are plenty of
>> callsites that do a simple put_page() instead of calling unpin_user_page().
>> IIRC, handling that correctly in existing code -- what was pinned must be
>> released via unpin_user_page() -- was the biggest workitem.
>>
>> Not sure how that relates to your work here (that's why I was asking): if
>> you could avoid FOLL_GET, that would be great :)
> 
> Take a good look at iter_to_pipe().  It does *not* need to pin anything
> (we have an ITER_SOURCE there); with this approach it will.  And it
> will stuff those pinned references into a pipe, where they can sit
> indefinitely.
> 
> IOW, I don't believe it's a usable approach.
> 

Not sure what makes you believe that FOLL_GET is any better for this 
long-term pinning, I'd like to learn about that.

As raised already somewhere in the whole discussion by me, the right way 
to take such a long-term ping as vmsplice() does is to use 
FOLL_PIN|FOLL_LONGTERM. As also raised, that will fix the last remaining 
vmsplice()+hugetlb COW issue as tested by the cow.c vm selftest and make 
sure to migrate that memory off of MIGRATE_MOVABLE/CMA memory where we 
cannot tolerate to have long-term unmovable memory sitting around.
David Howells Jan. 27, 2023, 12:05 a.m. UTC | #22
David Hildenbrand <david@redhat.com> wrote:

> As raised already somewhere in the whole discussion by me, the right way to
> take such a long-term ping as vmsplice() does is to use
> FOLL_PIN|FOLL_LONGTERM.

So the pipe infrastructure would have to be able to pin pages instead of
carrying refs on them?  What about pages just allocated and added to the pipe
ring in normal pipe use?

David
David Hildenbrand Jan. 27, 2023, 12:20 a.m. UTC | #23
On 27.01.23 01:05, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> As raised already somewhere in the whole discussion by me, the right way to
>> take such a long-term ping as vmsplice() does is to use
>> FOLL_PIN|FOLL_LONGTERM.
> 
> So the pipe infrastructure would have to be able to pin pages instead of
> carrying refs on them?  What about pages just allocated and added to the pipe
> ring in normal pipe use?
Ordinary kernel allocations (alloc_page() ...) always have to be freed 
via put_page() and friends. Such allocations are unmovable as default 
and don't require any special care.

Pages mapped into user space are movable as default and might be placed 
on ZONE_MOVABLE/CMA memory (well, and might be swapped out). 
FOLL_LONGTERM makes sure to migrate these pages off of such problematic 
physical memory regions, such that we can safely pin them until eternity.
diff mbox series

Patch

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 46d5080314c6..a4233049ab7a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -363,4 +363,32 @@  static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 /* Flags for iov_iter_get/extract_pages*() */
 #define ITER_ALLOW_P2PDMA	0x01	/* Allow P2PDMA on the extracted pages */
 
+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+			       size_t maxsize, unsigned int maxpages,
+			       unsigned int extract_flags, size_t *offset0);
+
+/**
+ * iov_iter_extract_mode - Indicate how pages from the iterator will be retained
+ * @iter: The iterator
+ * @extract_flags: How the iterator is to be used
+ *
+ * Examine the iterator and @extract_flags and indicate by returning FOLL_PIN,
+ * FOLL_GET or 0 as to how, if at all, pages extracted from the iterator will
+ * be retained by the extraction function.
+ *
+ * FOLL_GET indicates that the pages will have a reference taken on them that
+ * the caller must put.  This can be done for DMA/async DIO write from a page.
+ *
+ * FOLL_PIN indicates that the pages will have a pin placed in them that the
+ * caller must unpin.  This is must be done for DMA/async DIO read to a page to
+ * avoid CoW problems in fork.
+ *
+ * 0 indicates that no measures are taken and that it's up to the caller to
+ * retain the pages.
+ */
+#define iov_iter_extract_mode(iter, extract_flags) \
+	(user_backed_iter(iter) ?				\
+	 (iter->data_source == ITER_SOURCE) ?			\
+	 FOLL_GET : FOLL_PIN : 0)
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index fb04abe7d746..843abe566efb 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1916,3 +1916,427 @@  void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 		i->iov -= state->nr_segs - i->nr_segs;
 	i->nr_segs = state->nr_segs;
 }
+
+/*
+ * Extract a list of contiguous pages from an ITER_PIPE iterator.  This does
+ * not get references of its own on the pages, nor does it get a pin on them.
+ * If there's a partial page, it adds that first and will then allocate and add
+ * pages into the pipe to make up the buffer space to the amount required.
+ *
+ * The caller must hold the pipe locked and only transferring into a pipe is
+ * supported.
+ */
+static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int extract_flags,
+					   size_t *offset0)
+{
+	unsigned int nr, offset, chunk, j;
+	struct page **p;
+	size_t left;
+
+	if (!sanity(i))
+		return -EFAULT;
+
+	offset = pipe_npages(i, &nr);
+	if (!nr)
+		return -EFAULT;
+	*offset0 = offset;
+
+	maxpages = min_t(size_t, nr, maxpages);
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	left = maxsize;
+	for (j = 0; j < maxpages; j++) {
+		struct page *page = append_pipe(i, left, &offset);
+		if (!page)
+			break;
+		chunk = min_t(size_t, left, PAGE_SIZE - offset);
+		left -= chunk;
+		*p++ = page;
+	}
+	if (!j)
+		return -EFAULT;
+	return maxsize - left;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator.  This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+					     struct page ***pages, size_t maxsize,
+					     unsigned int maxpages,
+					     unsigned int extract_flags,
+					     size_t *offset0)
+{
+	struct page *page, **p;
+	unsigned int nr = 0, offset;
+	loff_t pos = i->xarray_start + i->iov_offset;
+	pgoff_t index = pos >> PAGE_SHIFT;
+	XA_STATE(xas, i->xarray, index);
+
+	offset = pos & ~PAGE_MASK;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	rcu_read_lock();
+	for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+		if (xas_retry(&xas, page))
+			continue;
+
+		/* Has the page moved or been split? */
+		if (unlikely(page != xas_reload(&xas))) {
+			xas_reset(&xas);
+			continue;
+		}
+
+		p[nr++] = find_subpage(page, xas.xa_index);
+		if (nr == maxpages)
+			break;
+	}
+	rcu_read_unlock();
+
+	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator.  This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int extract_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	size_t skip = i->iov_offset, offset;
+	int k;
+
+	for (;;) {
+		if (i->nr_segs == 0)
+			return 0;
+		maxsize = min(maxsize, i->bvec->bv_len - skip);
+		if (maxsize)
+			break;
+		i->iov_offset = 0;
+		i->nr_segs--;
+		i->kvec++;
+		skip = 0;
+	}
+
+	skip += i->bvec->bv_offset;
+	page = i->bvec->bv_page + skip / PAGE_SIZE;
+	offset = skip % PAGE_SIZE;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+	for (k = 0; k < maxpages; k++)
+		p[k] = page + k;
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of virtually contiguous pages from an ITER_KVEC iterator.
+ * This does not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int extract_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	const void *kaddr;
+	size_t skip = i->iov_offset, offset, len;
+	int k;
+
+	for (;;) {
+		if (i->nr_segs == 0)
+			return 0;
+		maxsize = min(maxsize, i->kvec->iov_len - skip);
+		if (maxsize)
+			break;
+		i->iov_offset = 0;
+		i->nr_segs--;
+		i->kvec++;
+		skip = 0;
+	}
+
+	offset = skip % PAGE_SIZE;
+	*offset0 = offset;
+	kaddr = i->kvec->iov_base;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	kaddr -= offset;
+	len = offset + maxsize;
+	for (k = 0; k < maxpages; k++) {
+		size_t seg = min_t(size_t, len, PAGE_SIZE);
+
+		if (is_vmalloc_or_module_addr(kaddr))
+			page = vmalloc_to_page(kaddr);
+		else
+			page = virt_to_page(kaddr);
+
+		p[k] = page;
+		len -= seg;
+		kaddr += PAGE_SIZE;
+	}
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Get the first segment from an ITER_UBUF or ITER_IOVEC iterator.  The
+ * iterator must not be empty.
+ */
+static unsigned long iov_iter_extract_first_user_segment(const struct iov_iter *i,
+							 size_t *size)
+{
+	size_t skip;
+	long k;
+
+	if (iter_is_ubuf(i))
+		return (unsigned long)i->ubuf + i->iov_offset;
+
+	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
+		size_t len = i->iov[k].iov_len - skip;
+
+		if (unlikely(!len))
+			continue;
+		if (*size > len)
+			*size = len;
+		return (unsigned long)i->iov[k].iov_base + skip;
+	}
+	BUG(); // if it had been empty, we wouldn't get called
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get references
+ * on them.  This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred out of the buffer described by
+ * the iterator (ie. this is the source).
+ *
+ * The pages are returned with incremented refcounts that the caller must undo
+ * once the transfer is complete, but no additional pins are obtained.
+ *
+ * This is only safe to be used where background IO/DMA is not going to be
+ * modifying the buffer, and so won't cause a problem with CoW on fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_get(struct iov_iter *i,
+						   struct page ***pages,
+						   size_t maxsize,
+						   unsigned int maxpages,
+						   unsigned int extract_flags,
+						   size_t *offset0)
+{
+	unsigned long addr;
+	unsigned int gup_flags = FOLL_GET;
+	size_t offset;
+	int res;
+
+	if (WARN_ON_ONCE(i->data_source != ITER_SOURCE))
+		return -EFAULT;
+
+	if (extract_flags & ITER_ALLOW_P2PDMA)
+		gup_flags |= FOLL_PCI_P2PDMA;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = iov_iter_extract_first_user_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = get_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them.  This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred into the buffer described by the
+ * iterator (ie. this is the destination).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes sure that CoW happens
+ * correctly in the parent during fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_pin(struct iov_iter *i,
+						   struct page ***pages,
+						   size_t maxsize,
+						   unsigned int maxpages,
+						   unsigned int extract_flags,
+						   size_t *offset0)
+{
+	unsigned long addr;
+	unsigned int gup_flags = FOLL_PIN | FOLL_WRITE;
+	size_t offset;
+	int res;
+
+	if (WARN_ON_ONCE(i->data_source != ITER_DEST))
+		return -EFAULT;
+
+	if (extract_flags & ITER_ALLOW_P2PDMA)
+		gup_flags |= FOLL_PCI_P2PDMA;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = first_iovec_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int extract_flags,
+					   size_t *offset0)
+{
+	if (iov_iter_extract_mode(i, extract_flags) == FOLL_GET)
+		return iov_iter_extract_user_pages_and_get(i, pages, maxsize,
+							   maxpages, extract_flags,
+							   offset0);
+	else
+		return iov_iter_extract_user_pages_and_pin(i, pages, maxsize,
+							   maxpages, extract_flags,
+							   offset0);
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @extract_flags: Flags to qualify request
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator.  The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base.  If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * @extract_flags can have ITER_ALLOW_P2PDMA set to request peer-to-peer DMA be
+ * allowed on the pages extracted.
+ *
+ * The iov_iter_extract_mode() function can be used to query how cleanup should
+ * be performed.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ *      transferred /OUT OF/ the buffer (@i->data_source == ITER_SOURCE), refs
+ *      will be taken on the pages, but pins will not be added.  This can be
+ *      used for DMA from a page; it cannot be used for DMA to a page, as it
+ *      may cause page-COW problems in fork.  iov_iter_extract_mode() will
+ *      return FOLL_GET.
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ *      transferred /INTO/ the described buffer (@i->data_source |= ITER_DEST),
+ *      pins will be added to the pages, but refs will not be taken.  This must
+ *      be used for DMA to a page.  iov_iter_extract_mode() will return
+ *      FOLL_PIN.
+ *
+ *  (*) If the iterator is ITER_PIPE, this must describe a destination for the
+ *      data.  Additional pages may be allocated and added to the pipe (which
+ *      will hold the refs), but neither refs nor pins will be obtained for the
+ *      caller.  The caller must hold the pipe lock.  iov_iter_extract_mode()
+ *      will return 0.
+ *
+ *  (*) If the iterator is ITER_KVEC, ITER_BVEC or ITER_XARRAY, the pages are
+ *      merely listed; no extra refs or pins are obtained.
+ *      iov_iter_extract_mode() will return 0.
+ *
+ * Note also:
+ *
+ *  (*) Peer-to-peer DMA (ITER_ALLOW_P2PDMA) is only permitted with user-backed
+ *      iterators.
+ *
+ *  (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page..
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i,
+			       struct page ***pages,
+			       size_t maxsize,
+			       unsigned int maxpages,
+			       unsigned int extract_flags,
+			       size_t *offset0)
+{
+	maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+	if (!maxsize)
+		return 0;
+
+	if (likely(user_backed_iter(i)))
+		return iov_iter_extract_user_pages(i, pages, maxsize,
+						   maxpages, extract_flags,
+						   offset0);
+	if (WARN_ON_ONCE(extract_flags & ITER_ALLOW_P2PDMA))
+		return -EIO;
+	if (iov_iter_is_kvec(i))
+		return iov_iter_extract_kvec_pages(i, pages, maxsize,
+						   maxpages, extract_flags,
+						   offset0);
+	if (iov_iter_is_bvec(i))
+		return iov_iter_extract_bvec_pages(i, pages, maxsize,
+						   maxpages, extract_flags,
+						   offset0);
+	if (iov_iter_is_pipe(i))
+		return iov_iter_extract_pipe_pages(i, pages, maxsize,
+						   maxpages, extract_flags,
+						   offset0);
+	if (iov_iter_is_xarray(i))
+		return iov_iter_extract_xarray_pages(i, pages, maxsize,
+						     maxpages, extract_flags,
+						     offset0);
+	return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(iov_iter_extract_pages);