[v6,00/11] ext4: port direct I/O to iomap infrastructure

Message ID	cover.1572255424.git.mbobrowski@mbobrowski.org (mailing list archive)
Headers	show Return-Path: <SRS0=YA0C=YV=vger.kernel.org=linux-fsdevel-owner@kernel.org> Date: Mon, 28 Oct 2019 21:50:03 +1100 From: Matthew Bobrowski <mbobrowski@mbobrowski.org> To: tytso@mit.edu, jack@suse.cz, adilger.kernel@dilger.ca Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, hch@infradead.org, david@fromorbit.com, darrick.wong@oracle.com Subject: [PATCH v6 00/11] ext4: port direct I/O to iomap infrastructure Message-ID: <cover.1572255424.git.mbobrowski@mbobrowski.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	ext4: port direct I/O to iomap infrastructure \| expand [v6,00/11] ext4: port direct I/O to iomap infrastructure [v6,01/11] ext4: reorder map.m_flags checks within ext4_iomap_begin() [v6,02/11] ext4: update direct I/O read lock pattern for IOCB_NOWAIT [v6,03/11] ext4: iomap that extends beyond EOF should be marked dirty [v6,04/11] ext4: move set iomap routines into a separate helper ext4_set_iomap() [v6,05/11] ext4: split IOMAP_WRITE branch in ext4_iomap_begin() into helper [v6,06/11] ext4: introduce new callback for IOMAP_REPORT [v6,07/11] ext4: introduce direct I/O read using iomap infrastructure [v6,08/11] ext4: move inode extension/truncate code out from ->iomap_end() callback [v6,09/11] ext4: move inode extension check out from ext4_iomap_alloc() [v6,11/11] ext4: introduce direct I/O write using iomap infrastructure [v6,10/11] ext4: update ext4_sync_file() to not use __generic_file_fsync()

Matthew Bobrowski Oct. 28, 2019, 10:50 a.m. UTC

Hi!

OK, so I think we're now coming very close to having this patch series
ready, if not already there. I've made some slight amendments to this
patch series as a result of the last round of feedback. This patch
series is based off Darrick's iomap-for-next branch, as there are some
iomap API changes there that we're dependent on here.

Changes since v5:

 - Fixed up calling into __generic_file_fsync() from
   ext4_sync_file(). Previously this caused a recursive lock situation
   when the filesystem was created without a journal.

 - Rolled up the orphan handling code from ext4_dio_write_iter() into
   ext4_handle_inode_extension().

 - Dropped redundant conditional statement and expression from
   ext4_iomap_is_delalloc().

 - Fixed up a couple comments, changelogs, as well as some other
   really minor grammatical corrections.

This patch series ports the ext4 direct I/O paths over to the iomap
infrastructure. The legacy buffer_head based direct I/O implementation
has subsequently been removed as it's no longer in use. The result of
this change is that ext4 now uses the newer iomap framework for direct
I/O read/write operations. Overall, this results in a much cleaner
direct I/O implementation and keeps this code isolated from the
buffer_head internals. In addition to this, a slight performance boost
may be expected while using O_SYNC | O_DIRECT.

The changes within this patch series have been tested via xfstests in
both DAX and non-DAX modes using the various filesystem configuration
options, including: 4k, dioread_nolock, nojournal.

Matthew Bobrowski (11):
  ext4: reorder map.m_flags checks within ext4_iomap_begin()
  ext4: update direct I/O read lock pattern for IOCB_NOWAIT
  ext4: iomap that extends beyond EOF should be marked dirty
  ext4: move set iomap routines into a separate helper ext4_set_iomap()
  ext4: split IOMAP_WRITE branch in ext4_iomap_begin() into helper
  ext4: introduce new callback for IOMAP_REPORT
  ext4: introduce direct I/O read using iomap infrastructure
  ext4: move inode extension/truncate code out from ->iomap_end()
    callback
  ext4: move inode extension check out from ext4_iomap_alloc()
  ext4: update ext4_sync_file() to not use __generic_file_fsync()
  ext4: introduce direct I/O write using iomap infrastructure

 fs/ext4/ext4.h    |   4 +-
 fs/ext4/extents.c |  11 +-
 fs/ext4/file.c    | 387 ++++++++++++++++++++-----
 fs/ext4/fsync.c   |  72 +++--
 fs/ext4/inode.c   | 714 +++++++++++-----------------------------------
 5 files changed, 537 insertions(+), 651 deletions(-)

Theodore Ts'o Oct. 29, 2019, 11:31 p.m. UTC | #1

Hi Matthew, it looks like there are a number of problems with this
patch series when using the ext3 backwards compatibility mode (e.g.,
no extents enabled).

So the following configurations are failing:

kvm-xfstests -c ext3   generic/091 generic/240 generic/263

It looks like the main issue is related to fsx failing.  On v5.4-rc3
the following command:

fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /vdd/junk

... has the following result:

root@kvm-xfstests:~# mount /vdd
[    9.366568] EXT4-fs (vdd): mounting ext3 file system using the ext4 subsystem
[    9.385537] EXT4-fs (vdd): recovery complete
[    9.389219] EXT4-fs (vdd): mounted filesystem with ordered data mode. Opts: (null)
root@kvm-xfstests:~# fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /vdd/junk
mapped writes DISABLED
Seed set to 1
main: filesystem does not support fallocate mode 0, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_KEEP_SIZE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_ZERO_RANGE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling!
main: filesystem does not support clone range, disabling!
main: filesystem does not support dedupe range, disabling!
truncating to largest ever: 0xe400
copying to largest ever: 0x6de00
copying to largest ever: 0x76a00
copying to largest ever: 0x78200
copying to largest ever: 0x78400
copying to largest ever: 0x78c00
truncating to largest ever: 0x79200
truncating to largest ever: 0x79600
copying to largest ever: 0x79800
All 10000 operations completed A-OK!
root@kvm-xfstests:~# 

However, with this patch series applied, the fsx command fails with a
"short write":

root@kvm-xfstests:~# mount /vdd
[    7.854352] EXT4-fs (vdd): mounting ext3 file system using the ext4 subsystem
[    7.892418] EXT4-fs (vdd): recovery complete
[    7.896480] EXT4-fs (vdd): mounted filesystem with ordered data mode. Opts: (null)
root@kvm-xfstests:~# fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /vdd/junk
mapped writes DISABLED
Seed set to 1
main: filesystem does not support fallocate mode 0, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_KEEP_SIZE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_ZERO_RANGE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling!
main: filesystem does not support clone range, disabling!
main: filesystem does not support dedupe range, disabling!
truncating to largest ever: 0xe400
copying to largest ever: 0x6de00
copying to largest ever: 0x76a00
short write: 0xc00 bytes instead of 0x9c00
LOG DUMP (60 total operations):
1(  1 mod 256): SKIPPED (no operation)
2(  2 mod 256): SKIPPED (no operation)
3(  3 mod 256): SKIPPED (no operation)
4(  4 mod 256): TRUNCATE UP	from 0x0 to 0xe400
5(  5 mod 256): SKIPPED (no operation)
6(  6 mod 256): SKIPPED (no operation)
7(  7 mod 256): SKIPPED (no operation)
8(  8 mod 256): WRITE    0x65a00 thru 0x665ff	(0xc00 bytes) HOLE
9(  9 mod 256): WRITE    0x17400 thru 0x1e7ff	(0x7400 bytes)
10( 10 mod 256): SKIPPED (no operation)
11( 11 mod 256): WRITE    0xa200 thru 0x12bff	(0x8a00 bytes)
12( 12 mod 256): SKIPPED (no operation)
13( 13 mod 256): SKIPPED (no operation)
14( 14 mod 256): SKIPPED (no operation)
15( 15 mod 256): SKIPPED (no operation)
16( 16 mod 256): SKIPPED (no operation)
17( 17 mod 256): SKIPPED (no operation)
18( 18 mod 256): COPY 0x8000 thru 0x13fff	(0xc000 bytes) to 0x61e00 thru 0x6ddff
19( 19 mod 256): SKIPPED (no operation)
20( 20 mod 256): COPY 0x3a000 thru 0x46fff	(0xd000 bytes) to 0x69a00 thru 0x769ff
21( 21 mod 256): READ     0x34000 thru 0x3efff	(0xb000 bytes)
22( 22 mod 256): WRITE    0x1f200 thru 0x2cbff	(0xda00 bytes)
23( 23 mod 256): READ     0x55000 thru 0x5bfff	(0x7000 bytes)
24( 24 mod 256): WRITE    0x23000 thru 0x285ff	(0x5600 bytes)
25( 25 mod 256): WRITE    0x47800 thru 0x4b1ff	(0x3a00 bytes)
26( 26 mod 256): SKIPPED (no operation)
27( 27 mod 256): READ     0x16000 thru 0x1afff	(0x5000 bytes)
28( 28 mod 256): SKIPPED (no operation)
29( 29 mod 256): SKIPPED (no operation)
30( 30 mod 256): SKIPPED (no operation)
31( 31 mod 256): SKIPPED (no operation)
32( 32 mod 256): SKIPPED (no operation)
33( 33 mod 256): SKIPPED (no operation)
34( 34 mod 256): SKIPPED (no operation)
35( 35 mod 256): READ     0x23000 thru 0x2afff	(0x8000 bytes)
36( 36 mod 256): SKIPPED (no operation)
37( 37 mod 256): PUNCH    0x11100 thru 0x18049	(0x6f4a bytes)
38( 38 mod 256): READ     0x3000 thru 0x5fff	(0x3000 bytes)
39( 39 mod 256): SKIPPED (no operation)
40( 40 mod 256): COPY 0x5b000 thru 0x65fff	(0xb000 bytes) to 0x66600 thru 0x715ff
41( 41 mod 256): SKIPPED (no operation)
42( 42 mod 256): WRITE    0x36c00 thru 0x3fdff	(0x9200 bytes)
43( 43 mod 256): SKIPPED (no operation)
44( 44 mod 256): PUNCH    0x3199d thru 0x3feaf	(0xe513 bytes)
45( 45 mod 256): SKIPPED (no operation)
46( 46 mod 256): SKIPPED (no operation)
47( 47 mod 256): COPY 0x71000 thru 0x75fff	(0x5000 bytes) to 0x38800 thru 0x3d7ff
48( 48 mod 256): SKIPPED (no operation)
49( 49 mod 256): SKIPPED (no operation)
50( 50 mod 256): SKIPPED (no operation)
51( 51 mod 256): READ     0x3a000 thru 0x43fff	(0xa000 bytes)
52( 52 mod 256): SKIPPED (no operation)
53( 53 mod 256): READ     0x10000 thru 0x16fff	(0x7000 bytes)
54( 54 mod 256): SKIPPED (no operation)
55( 55 mod 256): SKIPPED (no operation)
56( 56 mod 256): PUNCH    0x8a16 thru 0x1845d	(0xfa48 bytes)
57( 57 mod 256): WRITE    0x12800 thru 0x207ff	(0xe000 bytes)
58( 58 mod 256): SKIPPED (no operation)
59( 59 mod 256): COPY 0x28000 thru 0x36fff	(0xf000 bytes) to 0x9600 thru 0x185ff
60( 60 mod 256): WRITE    0x24000 thru 0x2dbff	(0x9c00 bytes)
Log of operations saved to "/vdd/junk.fsxops"; replay with --replay-ops
Correct content saved for comparison
(maybe hexdump "/vdd/junk" vs "/vdd/junk.fsxgood")

Could you take a look?  Thanks!!

						- Ted

Theodore Ts'o Oct. 29, 2019, 11:34 p.m. UTC | #2

On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote:
> Hi Matthew, it looks like there are a number of problems with this
> patch series when using the ext3 backwards compatibility mode (e.g.,
> no extents enabled).
> 
> So the following configurations are failing:
> 
> kvm-xfstests -c ext3   generic/091 generic/240 generic/263

Here are the details of the generic/240 failure:

root@kvm-xfstests:~# diff -u /root/xfstests/tests/generic/240.out /results/ext4/results-ext3/generic/240.out.bad
--- /root/xfstests/tests/generic/240.out	2019-10-21 15:29:56.000000000 -0400
+++ /results/ext4/results-ext3/generic/240.out.bad	2019-10-29 19:32:29.166850310 -0400
@@ -1,2 +1,7 @@
 QA output created by 240
 Silence is golden.
+AIO write offset 4608 expected 4096 got 512
+AIO write offset 8704 expected 4096 got 512
+non one buffer at buf[0] => 0x00,00,00,00
+non-one read at offset 12800
+*** WARNING *** /vdd/aiodio_sparse has not been unlinked; if you don't rm it manually first, it may influence the next run

     	     	 		    	    - Ted

Matthew Bobrowski Oct. 30, 2019, 2 a.m. UTC | #3

On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote:
> On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote:
> > Hi Matthew, it looks like there are a number of problems with this
> > patch series when using the ext3 backwards compatibility mode (e.g.,
> > no extents enabled).
> > 
> > So the following configurations are failing:
> > 
> > kvm-xfstests -c ext3   generic/091 generic/240 generic/263

This is one mode that I didn't get around to testing. Let me take a
look at the above and get back to you.

--<M>--

Jan Kara Oct. 30, 2019, 11:26 a.m. UTC | #4

On Wed 30-10-19 13:00:24, Matthew Bobrowski wrote:
> On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote:
> > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote:
> > > Hi Matthew, it looks like there are a number of problems with this
> > > patch series when using the ext3 backwards compatibility mode (e.g.,
> > > no extents enabled).
> > > 
> > > So the following configurations are failing:
> > > 
> > > kvm-xfstests -c ext3   generic/091 generic/240 generic/263
> 
> This is one mode that I didn't get around to testing. Let me take a
> look at the above and get back to you.

If I should guess, I'd start looking at what that -ENOTBLK fallback from
direct IO ends up doing as we seem to be hitting that path...

								Honza

Jan Kara Oct. 30, 2019, 11:39 a.m. UTC | #5

On Wed 30-10-19 12:26:52, Jan Kara wrote:
> On Wed 30-10-19 13:00:24, Matthew Bobrowski wrote:
> > On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote:
> > > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote:
> > > > Hi Matthew, it looks like there are a number of problems with this
> > > > patch series when using the ext3 backwards compatibility mode (e.g.,
> > > > no extents enabled).
> > > > 
> > > > So the following configurations are failing:
> > > > 
> > > > kvm-xfstests -c ext3   generic/091 generic/240 generic/263
> > 
> > This is one mode that I didn't get around to testing. Let me take a
> > look at the above and get back to you.
> 
> If I should guess, I'd start looking at what that -ENOTBLK fallback from
> direct IO ends up doing as we seem to be hitting that path...

Hum, actually no. This write from fsx output:

24( 24 mod 256): WRITE    0x23000 thru 0x285ff  (0x5600 bytes)

should have allocated blocks to where the failed write was going (0x24000).
But still I'd expect some interaction between how buffered writes to holes
interact with following direct IO writes... One of the subtle differences
we have introduced with iomap conversion is that the old code in
__generic_file_write_iter() did fsync & invalidate written range after
buffered write fallback and we don't seem to do that now (probably should
be fixed regardless of relation to this bug).

								Honza

Matthew Bobrowski Oct. 31, 2019, 9:16 a.m. UTC | #6

On Wed, Oct 30, 2019 at 12:39:18PM +0100, Jan Kara wrote:
> On Wed 30-10-19 12:26:52, Jan Kara wrote:
> > On Wed 30-10-19 13:00:24, Matthew Bobrowski wrote:
> > > On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote:
> > > > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote:
> > > > > Hi Matthew, it looks like there are a number of problems with this
> > > > > patch series when using the ext3 backwards compatibility mode (e.g.,
> > > > > no extents enabled).
> > > > > 
> > > > > So the following configurations are failing:
> > > > > 
> > > > > kvm-xfstests -c ext3   generic/091 generic/240 generic/263
> > > 
> > > This is one mode that I didn't get around to testing. Let me take a
> > > look at the above and get back to you.
> > 
> > If I should guess, I'd start looking at what that -ENOTBLK fallback from
> > direct IO ends up doing as we seem to be hitting that path...
> 
> Hum, actually no. This write from fsx output:
> 
> 24( 24 mod 256): WRITE    0x23000 thru 0x285ff  (0x5600 bytes)
> 
> should have allocated blocks to where the failed write was going (0x24000).
> But still I'd expect some interaction between how buffered writes to holes
> interact with following direct IO writes... One of the subtle differences
> we have introduced with iomap conversion is that the old code in
> __generic_file_write_iter() did fsync & invalidate written range after
> buffered write fallback and we don't seem to do that now (probably should
> be fixed regardless of relation to this bug).

After performing some debugging this afternoon, I quickly realised
that the fix for this is rather trivial. Within the previous direct
I/O implementation, we passed EXT4_GET_BLOCKS_CREATE to
ext4_map_blocks() for any writes to inodes without extents. I seem to
have missed that here and consequently block allocation for a write
wasn't performing correctly in such cases.

Also, I agree, the fsync + page cache invalidation bits need to be
implemented. I'm just thinking to branch out within
ext4_buffered_write_iter() and implement those bits there i.e.

	...
	ret = generic_perform_write();

	if (ret > 0 && iocb->ki_flags & IOCB_DIRECT) {
	   	err = filemap_write_and_wait_range();

		if (!err)
			invalidate_mapping_pages();
	...

AFAICT, this would be the most appropriate place to put it? Or, did
you have something else in mind?

--<M>--

Jan Kara Oct. 31, 2019, 4:54 p.m. UTC | #7

On Thu 31-10-19 20:16:41, Matthew Bobrowski wrote:
> On Wed, Oct 30, 2019 at 12:39:18PM +0100, Jan Kara wrote:
> > On Wed 30-10-19 12:26:52, Jan Kara wrote:
> > > On Wed 30-10-19 13:00:24, Matthew Bobrowski wrote:
> > > > On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote:
> > > > > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote:
> > > > > > Hi Matthew, it looks like there are a number of problems with this
> > > > > > patch series when using the ext3 backwards compatibility mode (e.g.,
> > > > > > no extents enabled).
> > > > > > 
> > > > > > So the following configurations are failing:
> > > > > > 
> > > > > > kvm-xfstests -c ext3   generic/091 generic/240 generic/263
> > > > 
> > > > This is one mode that I didn't get around to testing. Let me take a
> > > > look at the above and get back to you.
> > > 
> > > If I should guess, I'd start looking at what that -ENOTBLK fallback from
> > > direct IO ends up doing as we seem to be hitting that path...
> > 
> > Hum, actually no. This write from fsx output:
> > 
> > 24( 24 mod 256): WRITE    0x23000 thru 0x285ff  (0x5600 bytes)
> > 
> > should have allocated blocks to where the failed write was going (0x24000).
> > But still I'd expect some interaction between how buffered writes to holes
> > interact with following direct IO writes... One of the subtle differences
> > we have introduced with iomap conversion is that the old code in
> > __generic_file_write_iter() did fsync & invalidate written range after
> > buffered write fallback and we don't seem to do that now (probably should
> > be fixed regardless of relation to this bug).
> 
> After performing some debugging this afternoon, I quickly realised
> that the fix for this is rather trivial. Within the previous direct
> I/O implementation, we passed EXT4_GET_BLOCKS_CREATE to
> ext4_map_blocks() for any writes to inodes without extents. I seem to
> have missed that here and consequently block allocation for a write
> wasn't performing correctly in such cases.

No, this is not correct. For inodes without extents we used
ext4_dio_get_block() and we pass DIO_SKIP_HOLES to __blockdev_direct_IO().
Now DIO_SKIP_HOLES means that if starting block is within i_size, we pass
'create == 0' to get_blocks() function and thus ext4_dio_get_block() uses
'0' argument to ext4_map_blocks() similarly to what you do.

And indeed for inodes without extents we must fallback to buffered IO for
filling holes inside a file to avoid stale data exposure (racing DIO read
could read block contents before data is written to it if we used
EXT4_GET_BLOCKS_CREATE).

> Also, I agree, the fsync + page cache invalidation bits need to be
> implemented. I'm just thinking to branch out within
> ext4_buffered_write_iter() and implement those bits there i.e.
> 
> 	...
> 	ret = generic_perform_write();
> 
> 	if (ret > 0 && iocb->ki_flags & IOCB_DIRECT) {
> 	   	err = filemap_write_and_wait_range();
> 
> 		if (!err)
> 			invalidate_mapping_pages();
> 	...
> 
> AFAICT, this would be the most appropriate place to put it? Or, did
> you have something else in mind?

Yes, either this, or maybe in ext4_dio_write_iter() after returning from
ext4_buffered_write_iter() would be even more logical.

								Honza

Matthew Bobrowski Oct. 31, 2019, 10:58 p.m. UTC | #8

On Thu, Oct 31, 2019 at 05:54:16PM +0100, Jan Kara wrote:
> On Thu 31-10-19 20:16:41, Matthew Bobrowski wrote:
> > On Wed, Oct 30, 2019 at 12:39:18PM +0100, Jan Kara wrote:
> > > On Wed 30-10-19 12:26:52, Jan Kara wrote:
> > > Hum, actually no. This write from fsx output:
> > > 
> > > 24( 24 mod 256): WRITE    0x23000 thru 0x285ff  (0x5600 bytes)
> > > 
> > > should have allocated blocks to where the failed write was going (0x24000).
> > > But still I'd expect some interaction between how buffered writes to holes
> > > interact with following direct IO writes... One of the subtle differences
> > > we have introduced with iomap conversion is that the old code in
> > > __generic_file_write_iter() did fsync & invalidate written range after
> > > buffered write fallback and we don't seem to do that now (probably should
> > > be fixed regardless of relation to this bug).
> > 
> > After performing some debugging this afternoon, I quickly realised
> > that the fix for this is rather trivial. Within the previous direct
> > I/O implementation, we passed EXT4_GET_BLOCKS_CREATE to
> > ext4_map_blocks() for any writes to inodes without extents. I seem to
> > have missed that here and consequently block allocation for a write
> > wasn't performing correctly in such cases.
> 
> No, this is not correct. For inodes without extents we used
> ext4_dio_get_block() and we pass DIO_SKIP_HOLES to __blockdev_direct_IO().
> Now DIO_SKIP_HOLES means that if starting block is within i_size, we pass
> 'create == 0' to get_blocks() function and thus ext4_dio_get_block() uses
> '0' argument to ext4_map_blocks() similarly to what you do.

Ah right, I missed that part. :(

> And indeed for inodes without extents we must fallback to buffered IO for
> filling holes inside a file to avoid stale data exposure (racing DIO read
> could read block contents before data is written to it if we used
> EXT4_GET_BLOCKS_CREATE).

Well in this case I'm pretty sure I know exactly where the problem
resides. I seem to be falling back to buffered I/O from
ext4_dio_write_iter() without actually taking into account any of the
data that may have partially been written by the direct I/O. So, when
returning the bytes written back to userspace it's whatever actually
is returned by ext4_buffered_write_iter(), which may not necessarily
be the amount of bytes that were expected, so it should rather be
ext4_dio_write_iter() + ext4_buffered_write_iter()...

> > Also, I agree, the fsync + page cache invalidation bits need to be
> > implemented. I'm just thinking to branch out within
> > ext4_buffered_write_iter() and implement those bits there i.e.
> > 
> > 	...
> > 	ret = generic_perform_write();
> > 
> > 	if (ret > 0 && iocb->ki_flags & IOCB_DIRECT) {
> > 	   	err = filemap_write_and_wait_range();
> > 
> > 		if (!err)
> > 			invalidate_mapping_pages();
> > 	...
> > 
> > AFAICT, this would be the most appropriate place to put it? Or, did
> > you have something else in mind?
> 
> Yes, either this, or maybe in ext4_dio_write_iter() after returning from
> ext4_buffered_write_iter() would be even more logical.

Yes, let's stick with doing it within ext4_dio_write_iter().

--<M>--

Theodore Ts'o Nov. 3, 2019, 7:20 p.m. UTC | #9

Hi Matthew, could you do me a favor?  For the next (and hopefully
final :-) spin of this patch series, could you base it on the
ext4.git's master branch.  Then pull in Darrick's iomap-for-next
branch, and then apply your patches on top of that.

I attempted to do this with the v6 patch series --- see the tt/mb-dio
branch --- and I described on another e-mail thread, I appear to have
screwed up that patch conflicts, since it's causing a failure with
diroead-nolock using a 1k block size.  Since this wasn't something
that worked when you were first working on the patch set, this isn't
something I'm going to consider blocking, especially since a flay test
failure which happens 7% of the time, and using dioread_nolock with a
sub-page blocksize isn't something that is going to be all that common
(since it wasn't working at all up until now).

Still, I'm hoping that either Ritesh or you can figure out how my
simple-minded handling of the patch conflict between your and his
patch series can be addressed properly.

Thanks!!

						- Ted

Matthew Bobrowski Nov. 4, 2019, 6:04 a.m. UTC | #10

Howdy Ted!

On Sun, Nov 03, 2019 at 02:20:40PM -0500, Theodore Y. Ts'o wrote:
> Hi Matthew, could you do me a favor?  For the next (and hopefully
> final :-) spin of this patch series, could you base it on the
> ext4.git's master branch.  Then pull in Darrick's iomap-for-next
> branch, and then apply your patches on top of that.
> 
> I attempted to do this with the v6 patch series --- see the tt/mb-dio
> branch --- and I described on another e-mail thread, I appear to have
> screwed up that patch conflicts, since it's causing a failure with
> diroead-nolock using a 1k block size.  Since this wasn't something
> that worked when you were first working on the patch set, this isn't
> something I'm going to consider blocking, especially since a flay test
> failure which happens 7% of the time, and using dioread_nolock with a
> sub-page blocksize isn't something that is going to be all that common
> (since it wasn't working at all up until now).
> 
> Still, I'm hoping that either Ritesh or you can figure out how my
> simple-minded handling of the patch conflict between your and his
> patch series can be addressed properly.

OK, I will try get around to this tonight. :)

--<M>--

[v6,00/11] ext4: port direct I/O to iomap infrastructure

Message

Comments