Message ID | cover.1572255424.git.mbobrowski@mbobrowski.org (mailing list archive) |
---|---|
Headers | show |
Series | ext4: port direct I/O to iomap infrastructure | expand |
Hi Matthew, it looks like there are a number of problems with this patch series when using the ext3 backwards compatibility mode (e.g., no extents enabled). So the following configurations are failing: kvm-xfstests -c ext3 generic/091 generic/240 generic/263 It looks like the main issue is related to fsx failing. On v5.4-rc3 the following command: fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /vdd/junk ... has the following result: root@kvm-xfstests:~# mount /vdd [ 9.366568] EXT4-fs (vdd): mounting ext3 file system using the ext4 subsystem [ 9.385537] EXT4-fs (vdd): recovery complete [ 9.389219] EXT4-fs (vdd): mounted filesystem with ordered data mode. Opts: (null) root@kvm-xfstests:~# fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /vdd/junk mapped writes DISABLED Seed set to 1 main: filesystem does not support fallocate mode 0, disabling! main: filesystem does not support fallocate mode FALLOC_FL_KEEP_SIZE, disabling! main: filesystem does not support fallocate mode FALLOC_FL_ZERO_RANGE, disabling! main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling! main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling! main: filesystem does not support clone range, disabling! main: filesystem does not support dedupe range, disabling! truncating to largest ever: 0xe400 copying to largest ever: 0x6de00 copying to largest ever: 0x76a00 copying to largest ever: 0x78200 copying to largest ever: 0x78400 copying to largest ever: 0x78c00 truncating to largest ever: 0x79200 truncating to largest ever: 0x79600 copying to largest ever: 0x79800 All 10000 operations completed A-OK! root@kvm-xfstests:~# However, with this patch series applied, the fsx command fails with a "short write": root@kvm-xfstests:~# mount /vdd [ 7.854352] EXT4-fs (vdd): mounting ext3 file system using the ext4 subsystem [ 7.892418] EXT4-fs (vdd): recovery complete [ 7.896480] EXT4-fs (vdd): mounted filesystem with ordered data mode. Opts: (null) root@kvm-xfstests:~# fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /vdd/junk mapped writes DISABLED Seed set to 1 main: filesystem does not support fallocate mode 0, disabling! main: filesystem does not support fallocate mode FALLOC_FL_KEEP_SIZE, disabling! main: filesystem does not support fallocate mode FALLOC_FL_ZERO_RANGE, disabling! main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling! main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling! main: filesystem does not support clone range, disabling! main: filesystem does not support dedupe range, disabling! truncating to largest ever: 0xe400 copying to largest ever: 0x6de00 copying to largest ever: 0x76a00 short write: 0xc00 bytes instead of 0x9c00 LOG DUMP (60 total operations): 1( 1 mod 256): SKIPPED (no operation) 2( 2 mod 256): SKIPPED (no operation) 3( 3 mod 256): SKIPPED (no operation) 4( 4 mod 256): TRUNCATE UP from 0x0 to 0xe400 5( 5 mod 256): SKIPPED (no operation) 6( 6 mod 256): SKIPPED (no operation) 7( 7 mod 256): SKIPPED (no operation) 8( 8 mod 256): WRITE 0x65a00 thru 0x665ff (0xc00 bytes) HOLE 9( 9 mod 256): WRITE 0x17400 thru 0x1e7ff (0x7400 bytes) 10( 10 mod 256): SKIPPED (no operation) 11( 11 mod 256): WRITE 0xa200 thru 0x12bff (0x8a00 bytes) 12( 12 mod 256): SKIPPED (no operation) 13( 13 mod 256): SKIPPED (no operation) 14( 14 mod 256): SKIPPED (no operation) 15( 15 mod 256): SKIPPED (no operation) 16( 16 mod 256): SKIPPED (no operation) 17( 17 mod 256): SKIPPED (no operation) 18( 18 mod 256): COPY 0x8000 thru 0x13fff (0xc000 bytes) to 0x61e00 thru 0x6ddff 19( 19 mod 256): SKIPPED (no operation) 20( 20 mod 256): COPY 0x3a000 thru 0x46fff (0xd000 bytes) to 0x69a00 thru 0x769ff 21( 21 mod 256): READ 0x34000 thru 0x3efff (0xb000 bytes) 22( 22 mod 256): WRITE 0x1f200 thru 0x2cbff (0xda00 bytes) 23( 23 mod 256): READ 0x55000 thru 0x5bfff (0x7000 bytes) 24( 24 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes) 25( 25 mod 256): WRITE 0x47800 thru 0x4b1ff (0x3a00 bytes) 26( 26 mod 256): SKIPPED (no operation) 27( 27 mod 256): READ 0x16000 thru 0x1afff (0x5000 bytes) 28( 28 mod 256): SKIPPED (no operation) 29( 29 mod 256): SKIPPED (no operation) 30( 30 mod 256): SKIPPED (no operation) 31( 31 mod 256): SKIPPED (no operation) 32( 32 mod 256): SKIPPED (no operation) 33( 33 mod 256): SKIPPED (no operation) 34( 34 mod 256): SKIPPED (no operation) 35( 35 mod 256): READ 0x23000 thru 0x2afff (0x8000 bytes) 36( 36 mod 256): SKIPPED (no operation) 37( 37 mod 256): PUNCH 0x11100 thru 0x18049 (0x6f4a bytes) 38( 38 mod 256): READ 0x3000 thru 0x5fff (0x3000 bytes) 39( 39 mod 256): SKIPPED (no operation) 40( 40 mod 256): COPY 0x5b000 thru 0x65fff (0xb000 bytes) to 0x66600 thru 0x715ff 41( 41 mod 256): SKIPPED (no operation) 42( 42 mod 256): WRITE 0x36c00 thru 0x3fdff (0x9200 bytes) 43( 43 mod 256): SKIPPED (no operation) 44( 44 mod 256): PUNCH 0x3199d thru 0x3feaf (0xe513 bytes) 45( 45 mod 256): SKIPPED (no operation) 46( 46 mod 256): SKIPPED (no operation) 47( 47 mod 256): COPY 0x71000 thru 0x75fff (0x5000 bytes) to 0x38800 thru 0x3d7ff 48( 48 mod 256): SKIPPED (no operation) 49( 49 mod 256): SKIPPED (no operation) 50( 50 mod 256): SKIPPED (no operation) 51( 51 mod 256): READ 0x3a000 thru 0x43fff (0xa000 bytes) 52( 52 mod 256): SKIPPED (no operation) 53( 53 mod 256): READ 0x10000 thru 0x16fff (0x7000 bytes) 54( 54 mod 256): SKIPPED (no operation) 55( 55 mod 256): SKIPPED (no operation) 56( 56 mod 256): PUNCH 0x8a16 thru 0x1845d (0xfa48 bytes) 57( 57 mod 256): WRITE 0x12800 thru 0x207ff (0xe000 bytes) 58( 58 mod 256): SKIPPED (no operation) 59( 59 mod 256): COPY 0x28000 thru 0x36fff (0xf000 bytes) to 0x9600 thru 0x185ff 60( 60 mod 256): WRITE 0x24000 thru 0x2dbff (0x9c00 bytes) Log of operations saved to "/vdd/junk.fsxops"; replay with --replay-ops Correct content saved for comparison (maybe hexdump "/vdd/junk" vs "/vdd/junk.fsxgood") Could you take a look? Thanks!! - Ted
On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote: > Hi Matthew, it looks like there are a number of problems with this > patch series when using the ext3 backwards compatibility mode (e.g., > no extents enabled). > > So the following configurations are failing: > > kvm-xfstests -c ext3 generic/091 generic/240 generic/263 Here are the details of the generic/240 failure: root@kvm-xfstests:~# diff -u /root/xfstests/tests/generic/240.out /results/ext4/results-ext3/generic/240.out.bad --- /root/xfstests/tests/generic/240.out 2019-10-21 15:29:56.000000000 -0400 +++ /results/ext4/results-ext3/generic/240.out.bad 2019-10-29 19:32:29.166850310 -0400 @@ -1,2 +1,7 @@ QA output created by 240 Silence is golden. +AIO write offset 4608 expected 4096 got 512 +AIO write offset 8704 expected 4096 got 512 +non one buffer at buf[0] => 0x00,00,00,00 +non-one read at offset 12800 +*** WARNING *** /vdd/aiodio_sparse has not been unlinked; if you don't rm it manually first, it may influence the next run - Ted
On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote: > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote: > > Hi Matthew, it looks like there are a number of problems with this > > patch series when using the ext3 backwards compatibility mode (e.g., > > no extents enabled). > > > > So the following configurations are failing: > > > > kvm-xfstests -c ext3 generic/091 generic/240 generic/263 This is one mode that I didn't get around to testing. Let me take a look at the above and get back to you. --<M>--
On Wed 30-10-19 13:00:24, Matthew Bobrowski wrote: > On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote: > > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote: > > > Hi Matthew, it looks like there are a number of problems with this > > > patch series when using the ext3 backwards compatibility mode (e.g., > > > no extents enabled). > > > > > > So the following configurations are failing: > > > > > > kvm-xfstests -c ext3 generic/091 generic/240 generic/263 > > This is one mode that I didn't get around to testing. Let me take a > look at the above and get back to you. If I should guess, I'd start looking at what that -ENOTBLK fallback from direct IO ends up doing as we seem to be hitting that path... Honza
On Wed 30-10-19 12:26:52, Jan Kara wrote: > On Wed 30-10-19 13:00:24, Matthew Bobrowski wrote: > > On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote: > > > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote: > > > > Hi Matthew, it looks like there are a number of problems with this > > > > patch series when using the ext3 backwards compatibility mode (e.g., > > > > no extents enabled). > > > > > > > > So the following configurations are failing: > > > > > > > > kvm-xfstests -c ext3 generic/091 generic/240 generic/263 > > > > This is one mode that I didn't get around to testing. Let me take a > > look at the above and get back to you. > > If I should guess, I'd start looking at what that -ENOTBLK fallback from > direct IO ends up doing as we seem to be hitting that path... Hum, actually no. This write from fsx output: 24( 24 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes) should have allocated blocks to where the failed write was going (0x24000). But still I'd expect some interaction between how buffered writes to holes interact with following direct IO writes... One of the subtle differences we have introduced with iomap conversion is that the old code in __generic_file_write_iter() did fsync & invalidate written range after buffered write fallback and we don't seem to do that now (probably should be fixed regardless of relation to this bug). Honza
On Wed, Oct 30, 2019 at 12:39:18PM +0100, Jan Kara wrote: > On Wed 30-10-19 12:26:52, Jan Kara wrote: > > On Wed 30-10-19 13:00:24, Matthew Bobrowski wrote: > > > On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote: > > > > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote: > > > > > Hi Matthew, it looks like there are a number of problems with this > > > > > patch series when using the ext3 backwards compatibility mode (e.g., > > > > > no extents enabled). > > > > > > > > > > So the following configurations are failing: > > > > > > > > > > kvm-xfstests -c ext3 generic/091 generic/240 generic/263 > > > > > > This is one mode that I didn't get around to testing. Let me take a > > > look at the above and get back to you. > > > > If I should guess, I'd start looking at what that -ENOTBLK fallback from > > direct IO ends up doing as we seem to be hitting that path... > > Hum, actually no. This write from fsx output: > > 24( 24 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes) > > should have allocated blocks to where the failed write was going (0x24000). > But still I'd expect some interaction between how buffered writes to holes > interact with following direct IO writes... One of the subtle differences > we have introduced with iomap conversion is that the old code in > __generic_file_write_iter() did fsync & invalidate written range after > buffered write fallback and we don't seem to do that now (probably should > be fixed regardless of relation to this bug). After performing some debugging this afternoon, I quickly realised that the fix for this is rather trivial. Within the previous direct I/O implementation, we passed EXT4_GET_BLOCKS_CREATE to ext4_map_blocks() for any writes to inodes without extents. I seem to have missed that here and consequently block allocation for a write wasn't performing correctly in such cases. Also, I agree, the fsync + page cache invalidation bits need to be implemented. I'm just thinking to branch out within ext4_buffered_write_iter() and implement those bits there i.e. ... ret = generic_perform_write(); if (ret > 0 && iocb->ki_flags & IOCB_DIRECT) { err = filemap_write_and_wait_range(); if (!err) invalidate_mapping_pages(); ... AFAICT, this would be the most appropriate place to put it? Or, did you have something else in mind? --<M>--
On Thu 31-10-19 20:16:41, Matthew Bobrowski wrote: > On Wed, Oct 30, 2019 at 12:39:18PM +0100, Jan Kara wrote: > > On Wed 30-10-19 12:26:52, Jan Kara wrote: > > > On Wed 30-10-19 13:00:24, Matthew Bobrowski wrote: > > > > On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote: > > > > > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote: > > > > > > Hi Matthew, it looks like there are a number of problems with this > > > > > > patch series when using the ext3 backwards compatibility mode (e.g., > > > > > > no extents enabled). > > > > > > > > > > > > So the following configurations are failing: > > > > > > > > > > > > kvm-xfstests -c ext3 generic/091 generic/240 generic/263 > > > > > > > > This is one mode that I didn't get around to testing. Let me take a > > > > look at the above and get back to you. > > > > > > If I should guess, I'd start looking at what that -ENOTBLK fallback from > > > direct IO ends up doing as we seem to be hitting that path... > > > > Hum, actually no. This write from fsx output: > > > > 24( 24 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes) > > > > should have allocated blocks to where the failed write was going (0x24000). > > But still I'd expect some interaction between how buffered writes to holes > > interact with following direct IO writes... One of the subtle differences > > we have introduced with iomap conversion is that the old code in > > __generic_file_write_iter() did fsync & invalidate written range after > > buffered write fallback and we don't seem to do that now (probably should > > be fixed regardless of relation to this bug). > > After performing some debugging this afternoon, I quickly realised > that the fix for this is rather trivial. Within the previous direct > I/O implementation, we passed EXT4_GET_BLOCKS_CREATE to > ext4_map_blocks() for any writes to inodes without extents. I seem to > have missed that here and consequently block allocation for a write > wasn't performing correctly in such cases. No, this is not correct. For inodes without extents we used ext4_dio_get_block() and we pass DIO_SKIP_HOLES to __blockdev_direct_IO(). Now DIO_SKIP_HOLES means that if starting block is within i_size, we pass 'create == 0' to get_blocks() function and thus ext4_dio_get_block() uses '0' argument to ext4_map_blocks() similarly to what you do. And indeed for inodes without extents we must fallback to buffered IO for filling holes inside a file to avoid stale data exposure (racing DIO read could read block contents before data is written to it if we used EXT4_GET_BLOCKS_CREATE). > Also, I agree, the fsync + page cache invalidation bits need to be > implemented. I'm just thinking to branch out within > ext4_buffered_write_iter() and implement those bits there i.e. > > ... > ret = generic_perform_write(); > > if (ret > 0 && iocb->ki_flags & IOCB_DIRECT) { > err = filemap_write_and_wait_range(); > > if (!err) > invalidate_mapping_pages(); > ... > > AFAICT, this would be the most appropriate place to put it? Or, did > you have something else in mind? Yes, either this, or maybe in ext4_dio_write_iter() after returning from ext4_buffered_write_iter() would be even more logical. Honza
On Thu, Oct 31, 2019 at 05:54:16PM +0100, Jan Kara wrote: > On Thu 31-10-19 20:16:41, Matthew Bobrowski wrote: > > On Wed, Oct 30, 2019 at 12:39:18PM +0100, Jan Kara wrote: > > > On Wed 30-10-19 12:26:52, Jan Kara wrote: > > > Hum, actually no. This write from fsx output: > > > > > > 24( 24 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes) > > > > > > should have allocated blocks to where the failed write was going (0x24000). > > > But still I'd expect some interaction between how buffered writes to holes > > > interact with following direct IO writes... One of the subtle differences > > > we have introduced with iomap conversion is that the old code in > > > __generic_file_write_iter() did fsync & invalidate written range after > > > buffered write fallback and we don't seem to do that now (probably should > > > be fixed regardless of relation to this bug). > > > > After performing some debugging this afternoon, I quickly realised > > that the fix for this is rather trivial. Within the previous direct > > I/O implementation, we passed EXT4_GET_BLOCKS_CREATE to > > ext4_map_blocks() for any writes to inodes without extents. I seem to > > have missed that here and consequently block allocation for a write > > wasn't performing correctly in such cases. > > No, this is not correct. For inodes without extents we used > ext4_dio_get_block() and we pass DIO_SKIP_HOLES to __blockdev_direct_IO(). > Now DIO_SKIP_HOLES means that if starting block is within i_size, we pass > 'create == 0' to get_blocks() function and thus ext4_dio_get_block() uses > '0' argument to ext4_map_blocks() similarly to what you do. Ah right, I missed that part. :( > And indeed for inodes without extents we must fallback to buffered IO for > filling holes inside a file to avoid stale data exposure (racing DIO read > could read block contents before data is written to it if we used > EXT4_GET_BLOCKS_CREATE). Well in this case I'm pretty sure I know exactly where the problem resides. I seem to be falling back to buffered I/O from ext4_dio_write_iter() without actually taking into account any of the data that may have partially been written by the direct I/O. So, when returning the bytes written back to userspace it's whatever actually is returned by ext4_buffered_write_iter(), which may not necessarily be the amount of bytes that were expected, so it should rather be ext4_dio_write_iter() + ext4_buffered_write_iter()... > > Also, I agree, the fsync + page cache invalidation bits need to be > > implemented. I'm just thinking to branch out within > > ext4_buffered_write_iter() and implement those bits there i.e. > > > > ... > > ret = generic_perform_write(); > > > > if (ret > 0 && iocb->ki_flags & IOCB_DIRECT) { > > err = filemap_write_and_wait_range(); > > > > if (!err) > > invalidate_mapping_pages(); > > ... > > > > AFAICT, this would be the most appropriate place to put it? Or, did > > you have something else in mind? > > Yes, either this, or maybe in ext4_dio_write_iter() after returning from > ext4_buffered_write_iter() would be even more logical. Yes, let's stick with doing it within ext4_dio_write_iter(). --<M>--
Hi Matthew, could you do me a favor? For the next (and hopefully final :-) spin of this patch series, could you base it on the ext4.git's master branch. Then pull in Darrick's iomap-for-next branch, and then apply your patches on top of that. I attempted to do this with the v6 patch series --- see the tt/mb-dio branch --- and I described on another e-mail thread, I appear to have screwed up that patch conflicts, since it's causing a failure with diroead-nolock using a 1k block size. Since this wasn't something that worked when you were first working on the patch set, this isn't something I'm going to consider blocking, especially since a flay test failure which happens 7% of the time, and using dioread_nolock with a sub-page blocksize isn't something that is going to be all that common (since it wasn't working at all up until now). Still, I'm hoping that either Ritesh or you can figure out how my simple-minded handling of the patch conflict between your and his patch series can be addressed properly. Thanks!! - Ted
Howdy Ted! On Sun, Nov 03, 2019 at 02:20:40PM -0500, Theodore Y. Ts'o wrote: > Hi Matthew, could you do me a favor? For the next (and hopefully > final :-) spin of this patch series, could you base it on the > ext4.git's master branch. Then pull in Darrick's iomap-for-next > branch, and then apply your patches on top of that. > > I attempted to do this with the v6 patch series --- see the tt/mb-dio > branch --- and I described on another e-mail thread, I appear to have > screwed up that patch conflicts, since it's causing a failure with > diroead-nolock using a 1k block size. Since this wasn't something > that worked when you were first working on the patch set, this isn't > something I'm going to consider blocking, especially since a flay test > failure which happens 7% of the time, and using dioread_nolock with a > sub-page blocksize isn't something that is going to be all that common > (since it wasn't working at all up until now). > > Still, I'm hoping that either Ritesh or you can figure out how my > simple-minded handling of the patch conflict between your and his > patch series can be addressed properly. OK, I will try get around to this tonight. :) --<M>--