[3/5] mm: make buffered writes work with RWF_UNCACHED
diff mbox series

Message ID 20191210162454.8608-4-axboe@kernel.dk
State New
Headers show
Series
  • Support for RWF_UNCACHED
Related show

Commit Message

Jens Axboe Dec. 10, 2019, 4:24 p.m. UTC
If RWF_UNCACHED is set for io_uring (or pwritev2(2)), we'll drop the
cache instantiated for buffered writes. If new pages aren't
instantiated, we leave them alone. This provides similar semantics to
reads with RWF_UNCACHED set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h |  3 ++
 mm/filemap.c       | 78 +++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 76 insertions(+), 5 deletions(-)

Comments

Matthew Wilcox Dec. 10, 2019, 4:55 p.m. UTC | #1
On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
> +/*
> + * Start writeback on the pages in pgs[], and then try and remove those pages
> + * from the page cached. Used with RWF_UNCACHED.
> + */
> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
> +			     unsigned *nr)

It would seem more natural to use a pagevec instead of pgs/nr.

> +{
> +	loff_t start, end;
> +	int i;
> +
> +	end = 0;
> +	start = LLONG_MAX;
> +	for (i = 0; i < *nr; i++) {
> +		struct page *page = pgs[i];
> +		loff_t off;
> +
> +		off = (loff_t) page_to_index(page) << PAGE_SHIFT;

Isn't that page_offset()?

> +	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
> +
> +	for (i = 0; i < *nr; i++) {
> +		struct page *page = pgs[i];
> +
> +		lock_page(page);
> +		if (page->mapping == mapping) {

So you're protecting against the page being freed and reallocated to a
different file, but not against the page being freed and reallocated to
a location in the same file which is outside (start, end)?
Jens Axboe Dec. 10, 2019, 5:02 p.m. UTC | #2
On 12/10/19 9:55 AM, Matthew Wilcox wrote:
> On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
>> +/*
>> + * Start writeback on the pages in pgs[], and then try and remove those pages
>> + * from the page cached. Used with RWF_UNCACHED.
>> + */
>> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
>> +			     unsigned *nr)
> 
> It would seem more natural to use a pagevec instead of pgs/nr.

I did look into that, but they are intertwined with LRU etc. I
deliberately avoided the LRU on the read side, as it adds noticeable
overhead and gains us nothing since the pages will be dropped agian.

>> +{
>> +	loff_t start, end;
>> +	int i;
>> +
>> +	end = 0;
>> +	start = LLONG_MAX;
>> +	for (i = 0; i < *nr; i++) {
>> +		struct page *page = pgs[i];
>> +		loff_t off;
>> +
>> +		off = (loff_t) page_to_index(page) << PAGE_SHIFT;
> 
> Isn't that page_offset()?

I guess it is! I'll make that change.

>> +	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
>> +
>> +	for (i = 0; i < *nr; i++) {
>> +		struct page *page = pgs[i];
>> +
>> +		lock_page(page);
>> +		if (page->mapping == mapping) {
> 
> So you're protecting against the page being freed and reallocated to a
> different file, but not against the page being freed and reallocated
> to a location in the same file which is outside (start, end)?

I guess so, we can add that too, probably just check if the index is
still the same. More of a behavioral thing, shouldn't be any
correctness issues there.
Chris Mason Dec. 10, 2019, 6:35 p.m. UTC | #3
On 10 Dec 2019, at 12:02, Jens Axboe wrote:

> On 12/10/19 9:55 AM, Matthew Wilcox wrote:
>> On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
>>> +/*
>>> + * Start writeback on the pages in pgs[], and then try and remove 
>>> those pages
>>> + * from the page cached. Used with RWF_UNCACHED.
>>> + */
>>> +void write_drop_cached_pages(struct page **pgs, struct 
>>> address_space *mapping,
>>> +			     unsigned *nr)
>>
>> It would seem more natural to use a pagevec instead of pgs/nr.
>
> I did look into that, but they are intertwined with LRU etc. I
> deliberately avoided the LRU on the read side, as it adds noticeable
> overhead and gains us nothing since the pages will be dropped agian.
>
>>> +{
>>> +	loff_t start, end;
>>> +	int i;
>>> +
>>> +	end = 0;
>>> +	start = LLONG_MAX;
>>> +	for (i = 0; i < *nr; i++) {
>>> +		struct page *page = pgs[i];
>>> +		loff_t off;
>>> +
>>> +		off = (loff_t) page_to_index(page) << PAGE_SHIFT;
>>
>> Isn't that page_offset()?
>
> I guess it is! I'll make that change.
>
>>> +	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
>>> +
>>> +	for (i = 0; i < *nr; i++) {
>>> +		struct page *page = pgs[i];
>>> +
>>> +		lock_page(page);
>>> +		if (page->mapping == mapping) {
>>
>> So you're protecting against the page being freed and reallocated to 
>> a
>> different file, but not against the page being freed and reallocated
>> to a location in the same file which is outside (start, end)?
>
> I guess so, we can add that too, probably just check if the index is
> still the same. More of a behavioral thing, shouldn't be any
> correctness issues there.

Since we have a reference on the page, the mapping can go to NULL but 
otherwise it should stay in the same mapping at the same offset.

But, Jens and I both just realized he needs to take the reference on the 
page before write_end is called.

-chris
Matthew Wilcox Dec. 10, 2019, 6:58 p.m. UTC | #4
On Tue, Dec 10, 2019 at 10:02:18AM -0700, Jens Axboe wrote:
> On 12/10/19 9:55 AM, Matthew Wilcox wrote:
> > On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
> >> +/*
> >> + * Start writeback on the pages in pgs[], and then try and remove those pages
> >> + * from the page cached. Used with RWF_UNCACHED.
> >> + */
> >> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
> >> +			     unsigned *nr)
> > 
> > It would seem more natural to use a pagevec instead of pgs/nr.
> 
> I did look into that, but they are intertwined with LRU etc. I
> deliberately avoided the LRU on the read side, as it adds noticeable
> overhead and gains us nothing since the pages will be dropped agian.

I agree the LRU uses them, but they're used in all kinds of places where
we need to batch pages, eg truncate, munlock.
Jens Axboe Dec. 10, 2019, 7:10 p.m. UTC | #5
On 12/10/19 11:58 AM, Matthew Wilcox wrote:
> On Tue, Dec 10, 2019 at 10:02:18AM -0700, Jens Axboe wrote:
>> On 12/10/19 9:55 AM, Matthew Wilcox wrote:
>>> On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
>>>> +/*
>>>> + * Start writeback on the pages in pgs[], and then try and remove those pages
>>>> + * from the page cached. Used with RWF_UNCACHED.
>>>> + */
>>>> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
>>>> +			     unsigned *nr)
>>>
>>> It would seem more natural to use a pagevec instead of pgs/nr.
>>
>> I did look into that, but they are intertwined with LRU etc. I
>> deliberately avoided the LRU on the read side, as it adds noticeable
>> overhead and gains us nothing since the pages will be dropped agian.
> 
> I agree the LRU uses them, but they're used in all kinds of places where
> we need to batch pages, eg truncate, munlock.

I think I just had to convince myself that the release business works
the way it should for my use case, and I think it does. I'll test it.
release_pages() isn't exactly the prettiest thing out there.
Dave Chinner Dec. 11, 2019, 12:23 a.m. UTC | #6
On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
> If RWF_UNCACHED is set for io_uring (or pwritev2(2)), we'll drop the
> cache instantiated for buffered writes. If new pages aren't
> instantiated, we leave them alone. This provides similar semantics to
> reads with RWF_UNCACHED set.

So what about filesystems that don't use generic_perform_write()?
i.e. Anything that uses the iomap infrastructure (i.e.
iomap_file_buffered_write()) instead of generic_file_write_iter())
will currently ignore RWF_UNCACHED. That's XFS and gfs2 right now,
but there are likely to be more in the near future as more
filesystems are ported to the iomap infrastructure.

I'd also really like to see extensive fsx and fstress testing of
this new IO mode before it is committed - this is going to exercise page
cache coherency across different operations in new and unique
ways. that means we need patches to fstests to detect and use this
functionality when available, and new tests that explicitly exercise
combinations of buffered, mmap, dio and uncached for a range of
different IO size and alignments (e.g. mixing sector sized uncached
IO with page sized buffered/mmap/dio and vice versa).

We are not going to have a repeat of the copy_file_range() data
corruption fuckups because no testing was done and no test
infrastructure was written before the new API was committed.

> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
> +			     unsigned *nr)
> +{
> +	loff_t start, end;
> +	int i;
> +
> +	end = 0;
> +	start = LLONG_MAX;
> +	for (i = 0; i < *nr; i++) {
> +		struct page *page = pgs[i];
> +		loff_t off;
> +
> +		off = (loff_t) page_to_index(page) << PAGE_SHIFT;
> +		if (off < start)
> +			start = off;
> +		if (off > end)
> +			end = off;
> +		get_page(page);
> +	}
> +
> +	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
> +
> +	for (i = 0; i < *nr; i++) {
> +		struct page *page = pgs[i];
> +
> +		lock_page(page);
> +		if (page->mapping == mapping) {
> +			wait_on_page_writeback(page);
> +			if (!page_has_private(page) ||
> +			    try_to_release_page(page, 0))
> +				remove_mapping(mapping, page);
> +		}
> +		unlock_page(page);
> +	}
> +	*nr = 0;
> +}
> +EXPORT_SYMBOL_GPL(write_drop_cached_pages);
> +
> +#define GPW_PAGE_BATCH		16

In terms of performance, file fragmentation and premature filesystem
aging, this is also going to suck *really badly* for filesystems
that use delayed allocation because it is going to force conversion
of delayed allocation extents during the write() call. IOWs,
it adds all the overheads of doing delayed allocation, but it reaps
none of the benefits because it doesn't allow large contiguous
extents to build up in memory before physical allocation occurs.
i.e. there is no "delayed" in this allocation....

So it might work fine on a pristine, empty filesystem where it is
easy to find contiguous free space accross multiple allocations, but
it's going to suck after a few months of production usage has
fragmented all the free space into tiny pieces...

Cheers,

Dave.
Dave Chinner Dec. 11, 2019, 12:28 a.m. UTC | #7
On Wed, Dec 11, 2019 at 11:23:49AM +1100, Dave Chinner wrote:
> On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
> > If RWF_UNCACHED is set for io_uring (or pwritev2(2)), we'll drop the
> > cache instantiated for buffered writes. If new pages aren't
> > instantiated, we leave them alone. This provides similar semantics to
> > reads with RWF_UNCACHED set.
> 
> So what about filesystems that don't use generic_perform_write()?
> i.e. Anything that uses the iomap infrastructure (i.e.
> iomap_file_buffered_write()) instead of generic_file_write_iter())
> will currently ignore RWF_UNCACHED. That's XFS and gfs2 right now,
> but there are likely to be more in the near future as more
> filesystems are ported to the iomap infrastructure.

Hmmm - I'm missing part of this patchset, and I see a second version
has been posted that has iomap stuff in it. I'll go look at that
now...

Cheers,

Dave.
Jens Axboe Dec. 11, 2019, 2:39 p.m. UTC | #8
On 12/10/19 5:23 PM, Dave Chinner wrote:
> On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
>> If RWF_UNCACHED is set for io_uring (or pwritev2(2)), we'll drop the
>> cache instantiated for buffered writes. If new pages aren't
>> instantiated, we leave them alone. This provides similar semantics to
>> reads with RWF_UNCACHED set.
> 
> So what about filesystems that don't use generic_perform_write()?
> i.e. Anything that uses the iomap infrastructure (i.e.
> iomap_file_buffered_write()) instead of generic_file_write_iter())
> will currently ignore RWF_UNCACHED. That's XFS and gfs2 right now,
> but there are likely to be more in the near future as more
> filesystems are ported to the iomap infrastructure.

I'll skip this one as you found it.

> I'd also really like to see extensive fsx and fstress testing of
> this new IO mode before it is committed - this is going to exercise page
> cache coherency across different operations in new and unique
> ways. that means we need patches to fstests to detect and use this
> functionality when available, and new tests that explicitly exercise
> combinations of buffered, mmap, dio and uncached for a range of
> different IO size and alignments (e.g. mixing sector sized uncached
> IO with page sized buffered/mmap/dio and vice versa).
> 
> We are not going to have a repeat of the copy_file_range() data
> corruption fuckups because no testing was done and no test
> infrastructure was written before the new API was committed.

Oh I totally agree, and there's no push from my end on this. I just
think it's a cool feature and could be very useful, but it obviously
needs a healthy dose of testing and test cases written. I'll be doing
that as well.

>> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
>> +			     unsigned *nr)
>> +{
>> +	loff_t start, end;
>> +	int i;
>> +
>> +	end = 0;
>> +	start = LLONG_MAX;
>> +	for (i = 0; i < *nr; i++) {
>> +		struct page *page = pgs[i];
>> +		loff_t off;
>> +
>> +		off = (loff_t) page_to_index(page) << PAGE_SHIFT;
>> +		if (off < start)
>> +			start = off;
>> +		if (off > end)
>> +			end = off;
>> +		get_page(page);
>> +	}
>> +
>> +	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
>> +
>> +	for (i = 0; i < *nr; i++) {
>> +		struct page *page = pgs[i];
>> +
>> +		lock_page(page);
>> +		if (page->mapping == mapping) {
>> +			wait_on_page_writeback(page);
>> +			if (!page_has_private(page) ||
>> +			    try_to_release_page(page, 0))
>> +				remove_mapping(mapping, page);
>> +		}
>> +		unlock_page(page);
>> +	}
>> +	*nr = 0;
>> +}
>> +EXPORT_SYMBOL_GPL(write_drop_cached_pages);
>> +
>> +#define GPW_PAGE_BATCH		16
> 
> In terms of performance, file fragmentation and premature filesystem
> aging, this is also going to suck *really badly* for filesystems
> that use delayed allocation because it is going to force conversion
> of delayed allocation extents during the write() call. IOWs,
> it adds all the overheads of doing delayed allocation, but it reaps
> none of the benefits because it doesn't allow large contiguous
> extents to build up in memory before physical allocation occurs.
> i.e. there is no "delayed" in this allocation....
> 
> So it might work fine on a pristine, empty filesystem where it is
> easy to find contiguous free space accross multiple allocations, but
> it's going to suck after a few months of production usage has
> fragmented all the free space into tiny pieces...

I totally agree on this one, and I'm not a huge fan of it. But
considering your suggestion in the other email, I think we just need to
move this up a notch and do it per-write instead. If we can pass back
information about the state of the page cache for the range we care
about, then there's no reason to do it per-page for the write case.
Reads are still best done that way, and we can avoid the LRU overhead by
doing it that way.

Patch
diff mbox series

diff --git a/include/linux/fs.h b/include/linux/fs.h
index bf58db1bc032..bcf486c132a8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -285,6 +285,7 @@  enum positive_aop_returns {
 #define AOP_FLAG_NOFS			0x0002 /* used by filesystem to direct
 						* helper code (eg buffer layer)
 						* to clear GFP_FS from alloc */
+#define AOP_FLAG_UNCACHED		0x0004
 
 /*
  * oh the beauties of C type declarations.
@@ -3105,6 +3106,8 @@  extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_perform_write(struct file *, struct iov_iter *,
 				     struct kiocb *);
+extern void write_drop_cached_pages(struct page **,
+				    struct address_space *mapping, unsigned *);
 
 ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos,
 		rwf_t flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index fe37bd2b2630..d6171bf705f9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3287,10 +3287,12 @@  struct page *grab_cache_page_write_begin(struct address_space *mapping,
 					pgoff_t index, unsigned flags)
 {
 	struct page *page;
-	int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT;
+	int fgp_flags = FGP_LOCK|FGP_WRITE;
 
 	if (flags & AOP_FLAG_NOFS)
 		fgp_flags |= FGP_NOFS;
+	if (!(flags & AOP_FLAG_UNCACHED))
+		fgp_flags |= FGP_CREAT;
 
 	page = pagecache_get_page(mapping, index, fgp_flags,
 			mapping_gfp_mask(mapping));
@@ -3301,21 +3303,67 @@  struct page *grab_cache_page_write_begin(struct address_space *mapping,
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
 
+/*
+ * Start writeback on the pages in pgs[], and then try and remove those pages
+ * from the page cached. Used with RWF_UNCACHED.
+ */
+void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
+			     unsigned *nr)
+{
+	loff_t start, end;
+	int i;
+
+	end = 0;
+	start = LLONG_MAX;
+	for (i = 0; i < *nr; i++) {
+		struct page *page = pgs[i];
+		loff_t off;
+
+		off = (loff_t) page_to_index(page) << PAGE_SHIFT;
+		if (off < start)
+			start = off;
+		if (off > end)
+			end = off;
+		get_page(page);
+	}
+
+	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
+
+	for (i = 0; i < *nr; i++) {
+		struct page *page = pgs[i];
+
+		lock_page(page);
+		if (page->mapping == mapping) {
+			wait_on_page_writeback(page);
+			if (!page_has_private(page) ||
+			    try_to_release_page(page, 0))
+				remove_mapping(mapping, page);
+		}
+		unlock_page(page);
+	}
+	*nr = 0;
+}
+EXPORT_SYMBOL_GPL(write_drop_cached_pages);
+
+#define GPW_PAGE_BATCH		16
+
 ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, struct kiocb *iocb)
 {
 	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
+	struct page *drop_pages[GPW_PAGE_BATCH];
 	loff_t pos = iocb->ki_pos;
 	long status = 0;
 	ssize_t written = 0;
-	unsigned int flags = 0;
+	unsigned int flags = 0, drop_nr = 0;
 
 	do {
 		struct page *page;
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
+		bool drop_page = false;	/* drop page after IO */
 		void *fsdata;
 
 		offset = (pos & (PAGE_SIZE - 1));
@@ -3323,6 +3371,9 @@  ssize_t generic_perform_write(struct file *file,
 						iov_iter_count(i));
 
 again:
+		if (iocb->ki_flags & IOCB_UNCACHED)
+			flags |= AOP_FLAG_UNCACHED;
+
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
@@ -3343,10 +3394,17 @@  ssize_t generic_perform_write(struct file *file,
 			break;
 		}
 
+retry:
 		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 						&page, &fsdata);
-		if (unlikely(status < 0))
+		if (unlikely(status < 0)) {
+			if (status == -ENOMEM && (flags & AOP_FLAG_UNCACHED)) {
+				drop_page = true;
+				flags &= ~AOP_FLAG_UNCACHED;
+				goto retry;
+			}
 			break;
+		}
 
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
@@ -3376,12 +3434,22 @@  ssize_t generic_perform_write(struct file *file,
 						iov_iter_single_seg_count(i));
 			goto again;
 		}
+		if (drop_page &&
+		    ((pos >> PAGE_SHIFT) != ((pos + copied) >> PAGE_SHIFT))) {
+			drop_pages[drop_nr] = page;
+			if (++drop_nr == GPW_PAGE_BATCH)
+				write_drop_cached_pages(drop_pages, mapping,
+								&drop_nr);
+		} else
+			balance_dirty_pages_ratelimited(mapping);
+
 		pos += copied;
 		written += copied;
-
-		balance_dirty_pages_ratelimited(mapping);
 	} while (iov_iter_count(i));
 
+	if (drop_nr)
+		write_drop_cached_pages(drop_pages, mapping, &drop_nr);
+
 	return written ? written : status;
 }
 EXPORT_SYMBOL(generic_perform_write);