diff mbox series

[04/10] iov_iter: Add mapping and discard iterator types

Message ID 153685392942.14766.3347355712333618914.stgit@warthog.procyon.org.uk (mailing list archive)
State New, archived
Headers show
Series iov_iter: Add new iters and use with AFS | expand

Commit Message

David Howells Sept. 13, 2018, 3:52 p.m. UTC
Add two new iterator types to iov_iter:

 (1) ITER_MAPPING

     This walks through a set of pages attached to an address_space that
     are pinned or locked, starting at a given page and offset and walking
     for the specified amount of space.  A facility to get a callback each
     time a page is entirely processed is provided.

     This is useful for copying data from socket buffers to inodes in
     network filesystems.

 (2) ITER_DISCARD

     This is a sink iterator that can only be used in READ mode and just
     discards any data copied to it.

     This is useful in a network filesystem for discarding any unwanted
     data sent by a server.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/uio.h |    8 +
 lib/iov_iter.c      |  365 +++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 330 insertions(+), 43 deletions(-)

Comments

Al Viro Sept. 14, 2018, 4:18 a.m. UTC | #1
On Thu, Sep 13, 2018 at 04:52:09PM +0100, David Howells wrote:
> Add two new iterator types to iov_iter:
> 
>  (1) ITER_MAPPING
> 
>      This walks through a set of pages attached to an address_space that
>      are pinned or locked, starting at a given page and offset and walking
>      for the specified amount of space.  A facility to get a callback each
>      time a page is entirely processed is provided.
> 
>      This is useful for copying data from socket buffers to inodes in
>      network filesystems.

Interesting...  Questions:
	* what will hold those pages?  IOW, where will you unlock/drop/whatnot
those sucker?
	* "callback" sounds dangerous - it appears to imply that you won't
copy to/from the same page twice.  Not true for a lot of iov_iter users; what
happens if you pass such a beast to them?
	* why not simply "build and populate ITER_BVEC aliasing a piece of mapping",
possibly in "grab" and "grab+lock" variants?  Those ITER_MAPPING do seem to be
related to ITER_BVEC, at the very least.  Note, BTW, that iov_iter_get_pages...()
might mutate into something similar - "build and populate ITER_BVEC aliasing a piece
of given iov_iter".  Or, perhaps, a nicer-on-memory analogue of ITER_BVEC -
with <offset, bytes, pointer to pages array> instead of <offset, bytes, page> as
elements, with the same "populate from mapping" to get something similar to your
functionality and "populate from iov_iter" for iov_iter_get_pages... replacement.
Trond Myklebust Sept. 14, 2018, 12:57 p.m. UTC | #2
On Fri, 2018-09-14 at 05:18 +0100, Al Viro wrote:
> On Thu, Sep 13, 2018 at 04:52:09PM +0100, David Howells wrote:
> > Add two new iterator types to iov_iter:
> > 
> >  (1) ITER_MAPPING
> > 
> >      This walks through a set of pages attached to an address_space
> > that
> >      are pinned or locked, starting at a given page and offset and
> > walking
> >      for the specified amount of space.  A facility to get a
> > callback each
> >      time a page is entirely processed is provided.
> > 
> >      This is useful for copying data from socket buffers to inodes
> > in
> >      network filesystems.
> 
> Interesting...  Questions:
> 	* what will hold those pages?  IOW, where will you
> unlock/drop/whatnot
> those sucker?
> 	* "callback" sounds dangerous - it appears to imply that you
> won't
> copy to/from the same page twice.  Not true for a lot of iov_iter
> users; what
> happens if you pass such a beast to them?
> 	* why not simply "build and populate ITER_BVEC aliasing a piece
> of mapping",
> possibly in "grab" and "grab+lock" variants?  Those ITER_MAPPING do
> seem to be
> related to ITER_BVEC, at the very least.  Note, BTW, that
> iov_iter_get_pages...()
> might mutate into something similar - "build and populate ITER_BVEC
> aliasing a piece
> of given iov_iter".  Or, perhaps, a nicer-on-memory analogue of
> ITER_BVEC -
> with <offset, bytes, pointer to pages array> instead of <offset,
> bytes, page> as
> elements, with the same "populate from mapping" to get something
> similar to your
> functionality and "populate from iov_iter" for iov_iter_get_pages...
> replacement.

Another question that is relevant for most networked filesystems
(including AFS, I believe), is how will you deal with encryption of the
data you are transmitting? Encrypting and decrypting in-place directly
in the page cache or in a userspace O_DIRECT mapped buffer might not be
the best and most secure option, so won't you find yourself wanting to
copy the data anyway?
David Howells Sept. 17, 2018, 8:58 p.m. UTC | #3
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> > Add two new iterator types to iov_iter:
> > 
> >  (1) ITER_MAPPING
> > 
> >      This walks through a set of pages attached to an address_space that
> >      are pinned or locked, starting at a given page and offset and walking
> >      for the specified amount of space.  A facility to get a callback each
> >      time a page is entirely processed is provided.
> > 
> >      This is useful for copying data from socket buffers to inodes in
> >      network filesystems.
> 
> Interesting...  Questions:
> 	* what will hold those pages?  IOW, where will you unlock/drop/whatnot
> those sucker?

The caller needs to have those pages pinned - say with PG_locked or
PG_writeback.  Sorry - I mentioned this in the cover, but not here.

You can either undo there changes in the callback or upon completion of the
iteration.

> 	* "callback" sounds dangerous - it appears to imply that you won't
> copy to/from the same page twice.  Not true for a lot of iov_iter users; what
> happens if you pass such a beast to them?

Similar to ITER_PIPE.  There's no rewind.  Once you've passed a page, it's
gone.  Under what circumstances would you want to copy to/from the same page
twice?

> 	* why not simply "build and populate ITER_BVEC aliasing a piece of
> mapping", possibly in "grab" and "grab+lock" variants?

ITER_BVEC is inefficient.  This is what the upstream now.  See
afs_load_bvec().

That what the code currently uses.  There's a
practical limit to the number of pages I can shovel into one in one go.

Further, every time I reach the end of a ITER_BVEC, I have to return to
process context, which then has to round up the next bundle of pages by
calling the radix tree.  It seems to work out better to put the radix
iteration into the iterator if we can.  The caller guarantees that the
contents of the region of interest are (a) fully populated and (b) pinned.

Yet further, with ITER_BVEC, I can't release any of the pinned pages until the
entire iteration is finished.  That means if I have a 4GB BVEC, those pages
are going to be pinned a long time.  With ITER_MAPPING, they're released
incrementally via the callback.

> Those ITER_MAPPING do seem to be related to ITER_BVEC, at the very least.

Only in the sense that the current position can be described by the same three
numbers: page, len, offset.  I'm reusing struct bio_vec so that I can share
some of the code with ITER_BVEC.

> Note, BTW, that iov_iter_get_pages...() might mutate into something similar
> - "build and populate ITER_BVEC aliasing a piece of given iov_iter".  Or,
> perhaps, a nicer-on-memory analogue of ITER_BVEC - with <offset, bytes,
> pointer to pages array> instead of <offset, bytes, page> as elements, with
> the same "populate from mapping" to get something similar to your
> functionality and "populate from iov_iter" for
> iov_iter_get_pages... replacement

The whole point is to avoid having to use ITER_BVEC.  ITER_BVEC has a number
of issues that ITER_MAPPING overcomes - though ITER_MAPPING can only be used
with a mapping (or, at least, a radix tree).

There is no point to the loop shifting page runs into a page array for use
with a BVEC when the mapping carries the same information.  You save memory
and processing time.

David
David Howells Sept. 17, 2018, 9:32 p.m. UTC | #4
Trond Myklebust <trondmy@hammerspace.com> wrote:

> Another question that is relevant for most networked filesystems
> (including AFS, I believe), is how will you deal with encryption of the
> data you are transmitting? Encrypting and decrypting in-place directly
> in the page cache or in a userspace O_DIRECT mapped buffer might not be
> the best and most secure option, so won't you find yourself wanting to
> copy the data anyway?

For kAFS, the interface between kAFS and AF_RXRPC takes an iterator.

Currently, encryption is done in place on the sk_buffs inside AF_RXRPC, but
the goal I have in mind is to use the crypto operation to replace the copy
between sk_buff and buffer.  This is tricky, however, as the encrypted payload
contains metadata as well as data and on reception I have to read the metadata
to find out how much data there actually is.

David
diff mbox series

Patch

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 1e03cb50a0e0..1ecb96614a40 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -14,6 +14,7 @@ 
 #include <uapi/linux/uio.h>
 
 struct page;
+struct address_space;
 struct pipe_inode_info;
 
 struct kvec {
@@ -26,6 +27,8 @@  enum iter_type {
 	ITER_KVEC,
 	ITER_BVEC,
 	ITER_PIPE,
+	ITER_DISCARD,
+	ITER_MAPPING,
 };
 
 struct iov_iter {
@@ -37,6 +40,7 @@  struct iov_iter {
 		const struct iovec *iov;
 		const struct kvec *kvec;
 		const struct bio_vec *bvec;
+		struct address_space *mapping;
 		struct pipe_inode_info *pipe;
 	};
 	union {
@@ -45,6 +49,7 @@  struct iov_iter {
 			int idx;
 			int start_idx;
 		};
+		void (*page_done)(const struct iov_iter *, const struct bio_vec *);
 	};
 };
 
@@ -211,6 +216,9 @@  void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_
 			unsigned long nr_segs, size_t count);
 void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode_info *pipe,
 			size_t count);
+void iov_iter_mapping(struct iov_iter *i, unsigned int direction, struct address_space *mapping,
+		      loff_t start, size_t count);
+void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 8231f0e38f20..22b35464891b 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -72,7 +72,35 @@ 
 	}						\
 }
 
-#define iterate_all_kinds(i, n, v, I, B, K) {			\
+#define iterate_mapping(i, n, __v, do_done, skip, STEP) {	\
+	struct radix_tree_iter cursor;				\
+	size_t wanted = n, seg, offset;				\
+	pgoff_t index = skip >> PAGE_SHIFT;			\
+	void __rcu **slot;					\
+								\
+	rcu_read_lock();					\
+	radix_tree_for_each_contig(slot, &i->mapping->i_pages,	\
+				   &cursor, index) {		\
+		if (!n)						\
+			break;					\
+		__v.bv_page = radix_tree_deref_slot(slot);	\
+		if (!__v.bv_page)				\
+			break;					\
+		offset = skip & ~PAGE_MASK;			\
+		seg = PAGE_SIZE - offset;			\
+		__v.bv_offset = offset;				\
+		__v.bv_len = min(n, seg);			\
+		(void)(STEP);					\
+		if (do_done && __v.bv_offset + __v.bv_len == PAGE_SIZE)	\
+			i->page_done(i, &__v);			\
+		n -= __v.bv_len;				\
+		skip += __v.bv_len;				\
+	}							\
+	rcu_read_unlock();					\
+	n = wanted - n;						\
+}
+
+#define iterate_all_kinds(i, n, v, I, B, K, M) {		\
 	if (likely(n)) {					\
 		loff_t skip = i->iov_offset;			\
 		switch (iov_iter_type(i)) {			\
@@ -91,6 +119,14 @@ 
 		case ITER_PIPE: {				\
 			break;					\
 		}						\
+		case ITER_MAPPING: {				\
+			struct bio_vec v;			\
+			iterate_mapping(i, n, v, false, skip, (M));	\
+			break;					\
+		}						\
+		case ITER_DISCARD: {				\
+			break;					\
+		}						\
 		case ITER_IOVEC: {				\
 			const struct iovec *iov;		\
 			struct iovec v;				\
@@ -101,7 +137,7 @@ 
 	}							\
 }
 
-#define iterate_and_advance(i, n, v, I, B, K) {			\
+#define iterate_and_advance(i, n, v, I, B, K, M) {		\
 	if (unlikely(i->count < n))				\
 		n = i->count;					\
 	if (i->count) {						\
@@ -129,6 +165,11 @@ 
 			i->kvec = kvec;				\
 			break;					\
 		}						\
+		case ITER_MAPPING: {				\
+			struct bio_vec v;			\
+			iterate_mapping(i, n, v, i->page_done, skip, (M))	\
+			break;					\
+		}						\
 		case ITER_IOVEC: {				\
 			const struct iovec *iov;		\
 			struct iovec v;				\
@@ -144,6 +185,10 @@ 
 		case ITER_PIPE: {				\
 			break;					\
 		}						\
+		case ITER_DISCARD: {				\
+			skip += n;				\
+			break;					\
+		}						\
 		}						\
 		i->count -= n;					\
 		i->iov_offset = skip;				\
@@ -448,6 +493,8 @@  int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 		break;
 	case ITER_KVEC:
 	case ITER_BVEC:
+	case ITER_MAPPING:
+	case ITER_DISCARD:
 		break;
 	}
 	return 0;
@@ -593,7 +640,9 @@  size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 		copyout(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len),
 		memcpy_to_page(v.bv_page, v.bv_offset,
 			       (from += v.bv_len) - v.bv_len, v.bv_len),
-		memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
+		memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len),
+		memcpy_to_page(v.bv_page, v.bv_offset,
+			       (from += v.bv_len) - v.bv_len, v.bv_len)
 	)
 
 	return bytes;
@@ -708,6 +757,15 @@  size_t _copy_to_iter_mcsafe(const void *addr, size_t bytes, struct iov_iter *i)
 			bytes = curr_addr - s_addr - rem;
 			return bytes;
 		}
+		}),
+		({
+		rem = memcpy_mcsafe_to_page(v.bv_page, v.bv_offset,
+                               (from += v.bv_len) - v.bv_len, v.bv_len);
+		if (rem) {
+			curr_addr = (unsigned long) from;
+			bytes = curr_addr - s_addr - rem;
+			return bytes;
+		}
 		})
 	)
 
@@ -729,7 +787,9 @@  size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 		copyin((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
 		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
 				 v.bv_offset, v.bv_len),
-		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
+		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
 	)
 
 	return bytes;
@@ -755,7 +815,9 @@  bool _copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i)
 		0;}),
 		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
 				 v.bv_offset, v.bv_len),
-		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
+		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
 	)
 
 	iov_iter_advance(i, bytes);
@@ -775,7 +837,9 @@  size_t _copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 					 v.iov_base, v.iov_len),
 		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
 				 v.bv_offset, v.bv_len),
-		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
+		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
 	)
 
 	return bytes;
@@ -810,7 +874,9 @@  size_t _copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i)
 		memcpy_page_flushcache((to += v.bv_len) - v.bv_len, v.bv_page,
 				 v.bv_offset, v.bv_len),
 		memcpy_flushcache((to += v.iov_len) - v.iov_len, v.iov_base,
-			v.iov_len)
+			v.iov_len),
+		memcpy_page_flushcache((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
 	)
 
 	return bytes;
@@ -834,7 +900,9 @@  bool _copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
 		0;}),
 		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
 				 v.bv_offset, v.bv_len),
-		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
+		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
 	)
 
 	iov_iter_advance(i, bytes);
@@ -860,7 +928,8 @@  size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 		return 0;
 	switch (iov_iter_type(i)) {
 	case ITER_BVEC:
-	case ITER_KVEC: {
+	case ITER_KVEC:
+	case ITER_MAPPING: {
 		void *kaddr = kmap_atomic(page);
 		size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
 		kunmap_atomic(kaddr);
@@ -870,6 +939,8 @@  size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 		return copy_page_to_iter_iovec(page, offset, bytes, i);
 	case ITER_PIPE:
 		return copy_page_to_iter_pipe(page, offset, bytes, i);
+	case ITER_DISCARD:
+		return bytes;
 	}
 	BUG();
 }
@@ -882,7 +953,9 @@  size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 		return 0;
 	switch (iov_iter_type(i)) {
 	case ITER_PIPE:
+	case ITER_DISCARD:
 		break;
+	case ITER_MAPPING:
 	case ITER_BVEC:
 	case ITER_KVEC: {
 		void *kaddr = kmap_atomic(page);
@@ -930,7 +1003,8 @@  size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 	iterate_and_advance(i, bytes, v,
 		clear_user(v.iov_base, v.iov_len),
 		memzero_page(v.bv_page, v.bv_offset, v.bv_len),
-		memset(v.iov_base, 0, v.iov_len)
+		memset(v.iov_base, 0, v.iov_len),
+		memzero_page(v.bv_page, v.bv_offset, v.bv_len)
 	)
 
 	return bytes;
@@ -945,7 +1019,7 @@  size_t iov_iter_copy_from_user_atomic(struct page *page,
 		kunmap_atomic(kaddr);
 		return 0;
 	}
-	if (unlikely(iov_iter_is_pipe(i))) {
+	if (unlikely(iov_iter_is_pipe(i) || iov_iter_type(i) == ITER_DISCARD)) {
 		kunmap_atomic(kaddr);
 		WARN_ON(1);
 		return 0;
@@ -954,7 +1028,9 @@  size_t iov_iter_copy_from_user_atomic(struct page *page,
 		copyin((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
 		memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
 				 v.bv_offset, v.bv_len),
-		memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+		memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
+		memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
 	)
 	kunmap_atomic(kaddr);
 	return bytes;
@@ -1016,7 +1092,14 @@  void iov_iter_advance(struct iov_iter *i, size_t size)
 	case ITER_IOVEC:
 	case ITER_KVEC:
 	case ITER_BVEC:
-		iterate_and_advance(i, size, v, 0, 0, 0);
+		iterate_and_advance(i, size, v, 0, 0, 0, 0);
+		return;
+	case ITER_MAPPING:
+		/* We really don't want to fetch pages is we can avoid it */
+		i->iov_offset += size;
+		/* Fall through */
+	case ITER_DISCARD:
+		i->count -= size;
 		return;
 	}
 	BUG();
@@ -1060,6 +1143,14 @@  void iov_iter_revert(struct iov_iter *i, size_t unroll)
 	}
 	unroll -= i->iov_offset;
 	switch (iov_iter_type(i)) {
+	case ITER_MAPPING:
+		BUG(); /* We should never go beyond the start of the mapping
+			* since iov_offset includes that page number as well as
+			* the in-page offset.
+			*/
+	case ITER_DISCARD:
+		i->iov_offset = 0;
+		return;
 	case ITER_BVEC: {
 		const struct bio_vec *bvec = i->bvec;
 		while (1) {
@@ -1103,6 +1194,8 @@  size_t iov_iter_single_seg_count(const struct iov_iter *i)
 		return i->count;
 	switch (iov_iter_type(i)) {
 	case ITER_PIPE:
+	case ITER_DISCARD:
+	case ITER_MAPPING:
 		return i->count;	// it is a silly place, anyway
 	case ITER_BVEC:
 		return min_t(size_t, i->count, i->bvec->bv_len - i->iov_offset);
@@ -1158,6 +1251,52 @@  void iov_iter_pipe(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_pipe);
 
+/**
+ * iov_iter_mapping - Initialise an I/O iterator to use the pages in a mapping
+ * @i: The iterator to initialise.
+ * @direction: The direction of the transfer.
+ * @mapping: The mapping to access.
+ * @start: The start file position.
+ * @count: The size of the I/O buffer in bytes.
+ *
+ * Set up an I/O iterator to either draw data out of the pages attached to an
+ * inode or to inject data into those pages.  The pages *must* be prevented
+ * from evaporation, either by taking a ref on them or locking them by the
+ * caller.
+ */
+void iov_iter_mapping(struct iov_iter *i, unsigned int direction,
+		      struct address_space *mapping,
+		      loff_t start, size_t count)
+{
+	BUG_ON(direction & ~1);
+	i->iter_dir = direction;
+	i->iter_type = ITER_MAPPING;
+	i->mapping = mapping;
+	i->count = count;
+	i->iov_offset = start;
+	i->page_done = NULL;
+}
+EXPORT_SYMBOL(iov_iter_mapping);
+
+/**
+ * iov_iter_discard - Initialise an I/O iterator that discards data
+ * @i: The iterator to initialise.
+ * @direction: The direction of the transfer.
+ * @count: The size of the I/O buffer in bytes.
+ *
+ * Set up an I/O iterator that just discards everything that's written to it.
+ * It's only available as a READ iterator.
+ */
+void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count)
+{
+	BUG_ON(direction != READ);
+	i->iter_dir = READ;
+	i->iter_type = ITER_DISCARD;
+	i->count = count;
+	i->iov_offset = 0;
+}
+EXPORT_SYMBOL(iov_iter_discard);
+
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
 	unsigned long res = 0;
@@ -1171,7 +1310,8 @@  unsigned long iov_iter_alignment(const struct iov_iter *i)
 	iterate_all_kinds(i, size, v,
 		(res |= (unsigned long)v.iov_base | v.iov_len, 0),
 		res |= v.bv_offset | v.bv_len,
-		res |= (unsigned long)v.iov_base | v.iov_len
+		res |= (unsigned long)v.iov_base | v.iov_len,
+		res |= v.bv_offset | v.bv_len
 	)
 	return res;
 }
@@ -1182,7 +1322,7 @@  unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 	unsigned long res = 0;
 	size_t size = i->count;
 
-	if (unlikely(iov_iter_is_pipe(i))) {
+	if (unlikely(iov_iter_is_pipe(i) || iov_iter_type(i) == ITER_DISCARD)) {
 		WARN_ON(1);
 		return ~0U;
 	}
@@ -1193,7 +1333,9 @@  unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 		(res |= (!res ? 0 : (unsigned long)v.bv_offset) |
 			(size != v.bv_len ? size : 0)),
 		(res |= (!res ? 0 : (unsigned long)v.iov_base) |
-			(size != v.iov_len ? size : 0))
+			(size != v.iov_len ? size : 0)),
+		(res |= (!res ? 0 : (unsigned long)v.bv_offset) |
+			(size != v.bv_len ? size : 0))
 		);
 	return res;
 }
@@ -1243,6 +1385,43 @@  static ssize_t pipe_get_pages(struct iov_iter *i,
 	return __pipe_get_pages(i, min(maxsize, capacity), pages, idx, start);
 }
 
+static ssize_t iter_mapping_get_pages(struct iov_iter *i,
+				      struct page **pages, size_t maxsize,
+				      unsigned maxpages, size_t *start)
+{
+	unsigned nr, offset;
+	pgoff_t index, count;
+	size_t size = maxsize;
+
+	if (!size || !maxpages)
+		return 0;
+
+	index = i->iov_offset >> PAGE_SHIFT;
+	offset = i->iov_offset & ~PAGE_MASK;
+	*start = offset;
+
+	count = 1;
+	if (size > PAGE_SIZE - offset) {
+		size -= PAGE_SIZE - offset;
+		count += size >> PAGE_SHIFT;
+		size &= ~PAGE_MASK;
+		if (size)
+			count++;
+	}
+
+	if (count > maxpages)
+		count = maxpages;
+
+	nr = find_get_pages_contig(i->mapping, index, count, pages);
+	if (nr == count)
+		return maxsize;
+	if (nr == 0)
+		return 0;
+	if (nr == 1)
+		return PAGE_SIZE - offset;
+	return (PAGE_SIZE - offset) + count * PAGE_SIZE;
+}
+
 ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
@@ -1253,6 +1432,9 @@  ssize_t iov_iter_get_pages(struct iov_iter *i,
 	switch (iov_iter_type(i)) {
 	case ITER_PIPE:
 		return pipe_get_pages(i, pages, maxsize, maxpages, start);
+	case ITER_MAPPING:
+		return iter_mapping_get_pages(i, pages, maxsize, maxpages, start);
+	case ITER_DISCARD:
 	case ITER_KVEC:
 		return -EFAULT;
 	case ITER_IOVEC:
@@ -1279,9 +1461,7 @@  ssize_t iov_iter_get_pages(struct iov_iter *i,
 		*start = v.bv_offset;
 		get_page(*pages = v.bv_page);
 		return v.bv_len;
-	}),({
-		return -EFAULT;
-	})
+	}), 0, 0
 	)
 	return 0;
 }
@@ -1326,6 +1506,48 @@  static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 	return n;
 }
 
+static ssize_t iter_mapping_get_pages_alloc(struct iov_iter *i,
+					    struct page ***pages, size_t maxsize,
+					    size_t *start)
+{
+	struct page **p;
+	unsigned nr, offset;
+	pgoff_t index, count;
+	size_t size = maxsize;
+
+	if (!size)
+		return 0;
+
+	index = i->iov_offset >> PAGE_SHIFT;
+	offset = i->iov_offset & ~PAGE_MASK;
+	*start = offset;
+
+	count = 1;
+	if (size > PAGE_SIZE - offset) {
+		size -= PAGE_SIZE - offset;
+		count += size >> PAGE_SHIFT;
+		size &= ~PAGE_MASK;
+		if (size)
+			count++;
+	}
+
+	p = get_pages_array(count);
+	if (!p)
+		return -ENOMEM;
+	*pages = p;
+
+	nr = find_get_pages_contig(i->mapping, index, count, p);
+	if (nr == count)
+		return maxsize;
+	if (nr == 0) {
+		kvfree(p);
+		return 0;
+	}
+	if (nr == 1)
+		return PAGE_SIZE - offset;
+	return (PAGE_SIZE - offset) + count * PAGE_SIZE;
+}
+
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
@@ -1338,6 +1560,9 @@  ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 	switch (iov_iter_type(i)) {
 	case ITER_PIPE:
 		return pipe_get_pages_alloc(i, pages, maxsize, start);
+	case ITER_MAPPING:
+		return iter_mapping_get_pages_alloc(i, pages, maxsize, start);
+	case ITER_DISCARD:
 	case ITER_KVEC:
 		return -EFAULT;
 	case ITER_IOVEC:
@@ -1371,9 +1596,7 @@  ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 			return -ENOMEM;
 		get_page(*p = v.bv_page);
 		return v.bv_len;
-	}),({
-		return -EFAULT;
-	})
+	}), 0, 0
 	)
 	return 0;
 }
@@ -1386,7 +1609,7 @@  size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
-	if (unlikely(iov_iter_is_pipe(i))) {
+	if (unlikely(iov_iter_is_pipe(i) || iov_iter_type(i) == ITER_DISCARD)) {
 		WARN_ON(1);
 		return 0;
 	}
@@ -1414,6 +1637,14 @@  size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 						 v.iov_len, 0);
 		sum = csum_block_add(sum, next, off);
 		off += v.iov_len;
+	}), ({
+		char *p = kmap_atomic(v.bv_page);
+		next = csum_partial_copy_nocheck(p + v.bv_offset,
+						 (to += v.bv_len) - v.bv_len,
+						 v.bv_len, 0);
+		kunmap_atomic(p);
+		sum = csum_block_add(sum, next, off);
+		off += v.bv_len;
 	})
 	)
 	*csum = sum;
@@ -1428,7 +1659,7 @@  bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum,
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
-	if (unlikely(iov_iter_is_pipe(i))) {
+	if (unlikely(iov_iter_is_pipe(i) || iov_iter_type(i) == ITER_DISCARD)) {
 		WARN_ON(1);
 		return false;
 	}
@@ -1458,6 +1689,14 @@  bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum,
 						 v.iov_len, 0);
 		sum = csum_block_add(sum, next, off);
 		off += v.iov_len;
+	}), ({
+		char *p = kmap_atomic(v.bv_page);
+		next = csum_partial_copy_nocheck(p + v.bv_offset,
+						 (to += v.bv_len) - v.bv_len,
+						 v.bv_len, 0);
+		kunmap_atomic(p);
+		sum = csum_block_add(sum, next, off);
+		off += v.bv_len;
 	})
 	)
 	*csum = sum;
@@ -1473,7 +1712,7 @@  size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum,
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
-	if (unlikely(iov_iter_is_pipe(i))) {
+	if (unlikely(iov_iter_is_pipe(i) || iov_iter_type(i) == ITER_DISCARD)) {
 		WARN_ON(1);	/* for now */
 		return 0;
 	}
@@ -1501,6 +1740,14 @@  size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum,
 						 v.iov_len, 0);
 		sum = csum_block_add(sum, next, off);
 		off += v.iov_len;
+	}), ({
+		char *p = kmap_atomic(v.bv_page);
+		next = csum_partial_copy_nocheck((from += v.bv_len) - v.bv_len,
+						 p + v.bv_offset,
+						 v.bv_len, 0);
+		kunmap_atomic(p);
+		sum = csum_block_add(sum, next, off);
+		off += v.bv_len;
 	})
 	)
 	*csum = sum;
@@ -1516,7 +1763,8 @@  int iov_iter_npages(const struct iov_iter *i, int maxpages)
 	if (!size)
 		return 0;
 
-	if (unlikely(iov_iter_is_pipe(i))) {
+	switch (iov_iter_type(i)) {
+	case ITER_PIPE: {
 		struct pipe_inode_info *pipe = i->pipe;
 		size_t off;
 		int idx;
@@ -1529,24 +1777,47 @@  int iov_iter_npages(const struct iov_iter *i, int maxpages)
 		npages = ((pipe->curbuf - idx - 1) & (pipe->buffers - 1)) + 1;
 		if (npages >= maxpages)
 			return maxpages;
-	} else iterate_all_kinds(i, size, v, ({
-		unsigned long p = (unsigned long)v.iov_base;
-		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
-			- p / PAGE_SIZE;
-		if (npages >= maxpages)
-			return maxpages;
-	0;}),({
-		npages++;
-		if (npages >= maxpages)
-			return maxpages;
-	}),({
-		unsigned long p = (unsigned long)v.iov_base;
-		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
-			- p / PAGE_SIZE;
+	}
+	case ITER_MAPPING: {
+		unsigned offset;
+
+		offset = i->iov_offset & ~PAGE_MASK;
+
+		npages = 1;
+		if (size > PAGE_SIZE - offset) {
+			size -= PAGE_SIZE - offset;
+			npages += size >> PAGE_SHIFT;
+			size &= ~PAGE_MASK;
+			if (size)
+				npages++;
+		}
 		if (npages >= maxpages)
 			return maxpages;
-	})
-	)
+	}
+	case ITER_DISCARD:
+		return 0;
+
+	default:
+		iterate_all_kinds(i, size, v, ({
+			unsigned long p = (unsigned long)v.iov_base;
+			npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
+				- p / PAGE_SIZE;
+			if (npages >= maxpages)
+				return maxpages;
+		0;}),({
+			npages++;
+			if (npages >= maxpages)
+				return maxpages;
+		}),({
+			unsigned long p = (unsigned long)v.iov_base;
+			npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
+				- p / PAGE_SIZE;
+			if (npages >= maxpages)
+				return maxpages;
+		}),
+		0
+		)
+	}
 	return npages;
 }
 EXPORT_SYMBOL(iov_iter_npages);
@@ -1567,6 +1838,9 @@  const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags)
 		return new->iov = kmemdup(new->iov,
 				   new->nr_segs * sizeof(struct iovec),
 				   flags);
+	case ITER_MAPPING:
+	case ITER_DISCARD:
+		return NULL;
 	}
 
 	WARN_ON(1);
@@ -1670,7 +1944,12 @@  int iov_iter_for_each_range(struct iov_iter *i, size_t bytes,
 		kunmap(v.bv_page);
 		err;}), ({
 		w = v;
-		err = f(&w, context);})
+		err = f(&w, context);}), ({
+		w.iov_base = kmap(v.bv_page) + v.bv_offset;
+		w.iov_len = v.bv_len;
+		err = f(&w, context);
+		kunmap(v.bv_page);
+		err;})
 	)
 	return err;
 }