diff mbox series

[4/5] block: Add support for bouncing pinned pages

Message ID 20230209123206.3548-4-jack@suse.cz (mailing list archive)
State New
Headers show
Series Writeback handling of pinned pages | expand

Commit Message

Jan Kara Feb. 9, 2023, 12:31 p.m. UTC
When there is direct IO (or other DMA write) running into a page, it is
not generally safe to submit this page for another IO (such as
writeback) because this can cause checksum failures or similar issues.
However sometimes we cannot avoid writing contents of these pages as
pages can be pinned for extensive amount of time (e.g. for RDMA). For
these cases we need to just bounce the pages if we really need to write
them out. Add support for this type of bouncing into the block layer
infrastructure.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/blk.h               | 10 +++++++++-
 block/bounce.c            |  9 +++++++--
 include/linux/blk_types.h |  1 +
 mm/Kconfig                |  8 ++++----
 4 files changed, 21 insertions(+), 7 deletions(-)

Comments

Christoph Hellwig Feb. 13, 2023, 9:59 a.m. UTC | #1
Eww.  The block bounc code really needs to go away, so a new user
makes me very unhappy.

But independent of that I don't think this is enough anyway.  Just
copying the data out into a new page in the block layer doesn't solve
the problem that this page needs to be tracked as dirtied for fs
accounting.  e.g. every time we write this copy it needs space allocated
for COW file systems.

Which brings me back to if and when we do writeback for pinned page.
I don't think doing any I/O for short term pins like direct I/O
make sense.  These pins are defined to be unpinned after I/O
completes, so we might as well just wait for the unpin instead of doing
anything complicated.

Long term pins are more troublesome, but I really wonder what the
defined semantics for data integrity writeback like fsync on them
is to start with as the content is very much undefined.  Should
an fsync on a (partially) long term pinned file simplfy fail?  It's
not like we can win in that scenario.
Jan Kara Feb. 14, 2023, 1:56 p.m. UTC | #2
On Mon 13-02-23 01:59:28, Christoph Hellwig wrote:
> Eww.  The block bounc code really needs to go away, so a new user
> makes me very unhappy.
> 
> But independent of that I don't think this is enough anyway.  Just
> copying the data out into a new page in the block layer doesn't solve
> the problem that this page needs to be tracked as dirtied for fs
> accounting.  e.g. every time we write this copy it needs space allocated
> for COW file systems.

Right, I forgot about this in my RFC. My original plan was to not clear the
dirty bit in clear_page_dirty_for_io() even for WB_SYNC_ALL writeback when
we do writeback the page and perhaps indicate this in the return value of
clear_page_dirty_for_io() so that the COW filesystem can keep tracking this
page as dirty.

> Which brings me back to if and when we do writeback for pinned page.
> I don't think doing any I/O for short term pins like direct I/O
> make sense.  These pins are defined to be unpinned after I/O
> completes, so we might as well just wait for the unpin instead of doing
> anything complicated.

Agreed. For short term pins we could just wait which should be quite
simple (although there's some DoS potential of this behavior if somebody
runs multiple processes that keep pinning some page with short term pins).

> Long term pins are more troublesome, but I really wonder what the
> defined semantics for data integrity writeback like fsync on them
> is to start with as the content is very much undefined.  Should
> an fsync on a (partially) long term pinned file simplfy fail?  It's
> not like we can win in that scenario.

Well, we have also cases like sync(2) so one would have to be careful with
error propagation and I'm afraid there are enough programs out-there that
treat any error return from fsync(2) as catastrophic so I suspect this
could lead to some surprises. The case I'm most worried about is if some
application sets up RDMA to an mmaped file, runs the transfer and waits for
it to complete, doesn't bother to unpin the pages (keeps them for future
transfers) and calls fsync(2) to make data stable on local storage. That
does seem like quite sensible use and so far it works just fine. And not
writing pages with fsync(2) would break such uses.

								Honza
Dave Chinner Feb. 15, 2023, 4:59 a.m. UTC | #3
On Tue, Feb 14, 2023 at 02:56:04PM +0100, Jan Kara wrote:
> On Mon 13-02-23 01:59:28, Christoph Hellwig wrote:
> > Eww.  The block bounc code really needs to go away, so a new user
> > makes me very unhappy.
> > 
> > But independent of that I don't think this is enough anyway.  Just
> > copying the data out into a new page in the block layer doesn't solve
> > the problem that this page needs to be tracked as dirtied for fs
> > accounting.  e.g. every time we write this copy it needs space allocated
> > for COW file systems.
> 
> Right, I forgot about this in my RFC. My original plan was to not clear the
> dirty bit in clear_page_dirty_for_io() even for WB_SYNC_ALL writeback when
> we do writeback the page and perhaps indicate this in the return value of
> clear_page_dirty_for_io() so that the COW filesystem can keep tracking this
> page as dirty.

I don't think this works, especially if the COW mechanism relies on
delayed allocation to prevent ENOSPC during writeback. That is, we
need a write() or page fault (to run ->page_mkwrite()) after every
call to folio_clear_dirty_for_io() in the writeback path to ensure
that new space is reserved for the allocation that will occur
during a future writeback of that page.

Hence we can't just leave the page dirty on COW filesystems - it has
to go through a clean state so that the clean->dirty event can be
gated on gaining the space reservation that allows it to be written
back again.

Cheers,

Dave.
Christoph Hellwig Feb. 15, 2023, 6:24 a.m. UTC | #4
On Wed, Feb 15, 2023 at 03:59:52PM +1100, Dave Chinner wrote:
> I don't think this works, especially if the COW mechanism relies on
> delayed allocation to prevent ENOSPC during writeback. That is, we
> need a write() or page fault (to run ->page_mkwrite()) after every
> call to folio_clear_dirty_for_io() in the writeback path to ensure
> that new space is reserved for the allocation that will occur
> during a future writeback of that page.
> 
> Hence we can't just leave the page dirty on COW filesystems - it has
> to go through a clean state so that the clean->dirty event can be
> gated on gaining the space reservation that allows it to be written
> back again.

Exactly.  Although if we really want we could do the redirtying without
formally moving to a clean state, but it certainly would require special
new code to the same steps as if we were redirtying.

Which is another reason why I'd prefer to avoid all that if we can.
For that we probably need an inventory of what long term pins we have
in the kernel tree that can and do operate on shared file mappings,
and what kind of I/O semantics they expect.
Jan Kara Feb. 16, 2023, 12:33 p.m. UTC | #5
On Tue 14-02-23 22:24:33, Christoph Hellwig wrote:
> On Wed, Feb 15, 2023 at 03:59:52PM +1100, Dave Chinner wrote:
> > I don't think this works, especially if the COW mechanism relies on
> > delayed allocation to prevent ENOSPC during writeback. That is, we
> > need a write() or page fault (to run ->page_mkwrite()) after every
> > call to folio_clear_dirty_for_io() in the writeback path to ensure
> > that new space is reserved for the allocation that will occur
> > during a future writeback of that page.
> > 
> > Hence we can't just leave the page dirty on COW filesystems - it has
> > to go through a clean state so that the clean->dirty event can be
> > gated on gaining the space reservation that allows it to be written
> > back again.
> 
> Exactly.  Although if we really want we could do the redirtying without
> formally moving to a clean state, but it certainly would require special
> new code to the same steps as if we were redirtying.

Yes.

> Which is another reason why I'd prefer to avoid all that if we can.
> For that we probably need an inventory of what long term pins we have
> in the kernel tree that can and do operate on shared file mappings,
> and what kind of I/O semantics they expect.

I'm a bit skeptical we can reasonably assess that (as much as I would love
to just not write these pages and be done with it) because a lot of
FOLL_LONGTERM users just pin passed userspace address range, then allow
userspace to manipulate it with other operations, and finally unpin it with
another call. Who knows whether shared pagecache pages are passed in and
what userspace is doing with them while they are pinned? 

We have stuff like io_uring using FOLL_LONGTERM for IO buffers passed from
userspace (e.g. IORING_REGISTER_BUFFERS operation), we have V4L2 which
similarly pins buffers for video processing (and I vaguely remember one
bugreport due to some phone passing shared file pages there to acquire
screenshots from a webcam), and we have various infiniband drivers doing
this (not all of them are using FOLL_LONGTERM but they should AFAICS). We
even have vmsplice(2) that should be arguably using pinning with
FOLL_LONGTERM (at least that's the plan AFAIK) and not writing such pages
would IMO provide an interesting attack vector...

								Honza
Christoph Hellwig Feb. 20, 2023, 6:22 a.m. UTC | #6
On Thu, Feb 16, 2023 at 01:33:16PM +0100, Jan Kara wrote:
> I'm a bit skeptical we can reasonably assess that (as much as I would love
> to just not write these pages and be done with it) because a lot of
> FOLL_LONGTERM users just pin passed userspace address range, then allow
> userspace to manipulate it with other operations, and finally unpin it with
> another call. Who knows whether shared pagecache pages are passed in and
> what userspace is doing with them while they are pinned? 

True.

So what other sensible thing could we do at a higher level?

Treat MAP_SHARED buffers that are long term registered as MAP_PRIVATE
while registered, and just do writeback using in-kernel O_DIRECT on
fsync?
Jan Kara Feb. 27, 2023, 11:39 a.m. UTC | #7
On Sun 19-02-23 22:22:32, Christoph Hellwig wrote:
> On Thu, Feb 16, 2023 at 01:33:16PM +0100, Jan Kara wrote:
> > I'm a bit skeptical we can reasonably assess that (as much as I would love
> > to just not write these pages and be done with it) because a lot of
> > FOLL_LONGTERM users just pin passed userspace address range, then allow
> > userspace to manipulate it with other operations, and finally unpin it with
> > another call. Who knows whether shared pagecache pages are passed in and
> > what userspace is doing with them while they are pinned? 
> 
> True.
> 
> So what other sensible thing could we do at a higher level?
> 
> Treat MAP_SHARED buffers that are long term registered as MAP_PRIVATE
> while registered, and just do writeback using in-kernel O_DIRECT on
> fsync?

Do you mean to copy these pages on fsync(2) to newly allocated buffer and
then submit it via direct IO? That looks sensible to me. We could then make
writeback path just completely ignore these long term pinned pages and just
add this copying logic into filemap_fdatawrite() or something like that.

								Honza
Christoph Hellwig Feb. 27, 2023, 1:36 p.m. UTC | #8
On Mon, Feb 27, 2023 at 12:39:26PM +0100, Jan Kara wrote:
> Do you mean to copy these pages on fsync(2) to newly allocated buffer and
> then submit it via direct IO? That looks sensible to me. We could then make
> writeback path just completely ignore these long term pinned pages and just
> add this copying logic into filemap_fdatawrite() or something like that.

I don't think we'd even have to copy them on fsync.  Just do an in-kernel
ITER_BVEC direct I/O on them.  The only hard part would be to come up
with a way to skip the pagecache invalidation and writeout for these
I/Os.
diff mbox series

Patch

diff --git a/block/blk.h b/block/blk.h
index 4c3b3325219a..def7ab8379bc 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -384,10 +384,18 @@  static inline bool blk_queue_may_bounce(struct request_queue *q)
 		max_low_pfn >= max_pfn;
 }
 
+static inline bool bio_need_pin_bounce(struct bio *bio,
+		struct request_queue *q)
+{
+	return IS_ENABLED(CONFIG_BOUNCE) &&
+		bio->bi_flags & (1 << BIO_NEED_PIN_BOUNCE);
+}
+
 static inline struct bio *blk_queue_bounce(struct bio *bio,
 		struct request_queue *q)
 {
-	if (unlikely(blk_queue_may_bounce(q) && bio_has_data(bio)))
+	if (unlikely((blk_queue_may_bounce(q) || bio_need_pin_bounce(bio, q)) &&
+		     bio_has_data(bio)))
 		return __blk_queue_bounce(bio, q);
 	return bio;
 }
diff --git a/block/bounce.c b/block/bounce.c
index 7cfcb242f9a1..ebda95953d58 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -207,12 +207,16 @@  struct bio *__blk_queue_bounce(struct bio *bio_orig, struct request_queue *q)
 	struct bvec_iter iter;
 	unsigned i = 0, bytes = 0;
 	bool bounce = false;
+	bool pinned_bounce = bio_orig->bi_flags & (1 << BIO_NEED_PIN_BOUNCE);
+	bool highmem_bounce = blk_queue_may_bounce(q);
 	int sectors;
 
 	bio_for_each_segment(from, bio_orig, iter) {
 		if (i++ < BIO_MAX_VECS)
 			bytes += from.bv_len;
-		if (PageHighMem(from.bv_page))
+		if (highmem_bounce && PageHighMem(from.bv_page))
+			bounce = true;
+		if (pinned_bounce && page_maybe_dma_pinned(from.bv_page))
 			bounce = true;
 	}
 	if (!bounce)
@@ -241,7 +245,8 @@  struct bio *__blk_queue_bounce(struct bio *bio_orig, struct request_queue *q)
 	for (i = 0, to = bio->bi_io_vec; i < bio->bi_vcnt; to++, i++) {
 		struct page *bounce_page;
 
-		if (!PageHighMem(to->bv_page))
+		if (!((highmem_bounce && PageHighMem(to->bv_page)) ||
+		      (pinned_bounce && page_maybe_dma_pinned(to->bv_page))))
 			continue;
 
 		bounce_page = mempool_alloc(&page_pool, GFP_NOIO);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..3aa1dc5d8dc6 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -321,6 +321,7 @@  enum {
 	BIO_NO_PAGE_REF,	/* don't put release vec pages */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
+	BIO_NEED_PIN_BOUNCE,	/* bio needs to bounce pinned pages */
 	BIO_QUIET,		/* Make BIO Quiet */
 	BIO_CHAIN,		/* chained bio, ->bi_remaining in effect */
 	BIO_REFFED,		/* bio has elevated ->bi_cnt */
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..eba075e959e8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -659,11 +659,11 @@  config PHYS_ADDR_T_64BIT
 config BOUNCE
 	bool "Enable bounce buffers"
 	default y
-	depends on BLOCK && MMU && HIGHMEM
+	depends on BLOCK && MMU
 	help
-	  Enable bounce buffers for devices that cannot access the full range of
-	  memory available to the CPU. Enabled by default when HIGHMEM is
-	  selected, but you may say n to override this.
+	  Enable bounce buffers. This is used for devices that cannot access
+	  the full range of memory available to the CPU or when DMA can be
+	  modifying pages while they are submitted for writeback.
 
 config MMU_NOTIFIER
 	bool