Message ID | 20211216210715.3801857-26-willy@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iomap/xfs folio patches | expand |
[resend with shorter 522.out file to keep us under the 300k maximum] On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote: > Now that iomap has been converted, XFS is large folio safe. > Indicate to the VFS that it can now create large folios for XFS. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > Reviewed-by: Christoph Hellwig <hch@lst.de> > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > --- > fs/xfs/xfs_icache.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index da4af2142a2b..cdc39f576ca1 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -87,6 +87,7 @@ xfs_inode_alloc( > /* VFS doesn't initialise i_mode or i_state! */ > VFS_I(ip)->i_mode = 0; > VFS_I(ip)->i_state = 0; > + mapping_set_large_folios(VFS_I(ip)->i_mapping); > > XFS_STATS_INC(mp, vn_active); > ASSERT(atomic_read(&ip->i_pincount) == 0); > @@ -320,6 +321,7 @@ xfs_reinit_inode( > inode->i_rdev = dev; > inode->i_uid = uid; > inode->i_gid = gid; > + mapping_set_large_folios(inode->i_mapping); Hmm. Ever since 5.19-rc1, I've noticed that fsx in generic/522 now reports file corruption after 20 minutes of runtime. The corruption is surprisingly reproducible (522.out.bad attached below) in that I ran it three times and always got the same bad offset (0x6e000) and always the same opcode (6213798(166 mod 256) MAPREAD). I turned off multipage folios and now 522 has run for over an hour without problems, so before I go do more debugging, does this ring a bell to anyone? [addendum: Apparently vger now has a 300K message size limit; if you want the full output, see https://djwong.org/docs/522.out.txt ] --D QA output created by 522 READ BAD DATA: offset = 0x69e3e, size = 0x1c922, fname = /mnt/junk OFFSET GOOD BAD RANGE 0x6e000 0x0000 0x9173 0x00000 operation# (mod 256) for the bad data may be 145 0x6e001 0x0000 0x7391 0x00001 operation# (mod 256) for the bad data may be 145 0x6e002 0x0000 0x9195 0x00002 operation# (mod 256) for the bad data may be 145 0x6e003 0x0000 0x9591 0x00003 operation# (mod 256) for the bad data may be 145 0x6e004 0x0000 0x91b5 0x00004 operation# (mod 256) for the bad data may be 145 0x6e005 0x0000 0xb591 0x00005 operation# (mod 256) for the bad data may be 145 0x6e006 0x0000 0x91e2 0x00006 operation# (mod 256) for the bad data may be 145 0x6e007 0x0000 0xe291 0x00007 operation# (mod 256) for the bad data may be 145 0x6e008 0x0000 0x919d 0x00008 operation# (mod 256) for the bad data may be 145 0x6e009 0x0000 0x9d91 0x00009 operation# (mod 256) for the bad data may be 145 0x6e00a 0x0000 0x91e8 0x0000a operation# (mod 256) for the bad data may be 145 0x6e00b 0x0000 0xe891 0x0000b operation# (mod 256) for the bad data may be 145 0x6e00c 0x0000 0x91c9 0x0000c operation# (mod 256) for the bad data may be 145 0x6e00d 0x0000 0xc991 0x0000d operation# (mod 256) for the bad data may be 145 0x6e00e 0x0000 0x9147 0x0000e operation# (mod 256) for the bad data may be 145 0x6e00f 0x0000 0x4791 0x0000f operation# (mod 256) for the bad data may be 145 LOG DUMP (6213798 total operations): <snip> 6213732(100 mod 256): COLLAPSE 0x3b000 thru 0x4efff (0x14000 bytes) 6213733(101 mod 256): READ 0x1953d thru 0x29311 (0xfdd5 bytes) 6213734(102 mod 256): INSERT 0x14000 thru 0x2ffff (0x1c000 bytes) 6213735(103 mod 256): COPY 0x1d381 thru 0x36d38 (0x199b8 bytes) to 0x64491 thru 0x7de48 ******EEEE 6213736(104 mod 256): ZERO 0x74247 thru 0x927bf (0x1e579 bytes) 6213737(105 mod 256): INSERT 0x8000 thru 0x16fff (0xf000 bytes) 6213738(106 mod 256): READ 0x87aba thru 0x8ce48 (0x538f bytes) 6213739(107 mod 256): TRUNCATE DOWN from 0x8ce49 to 0x46571 ******WWWW 6213740(108 mod 256): SKIPPED (no operation) 6213741(109 mod 256): ZERO 0x55674 thru 0x70d41 (0x1b6ce bytes) ******ZZZZ 6213742(110 mod 256): PUNCH 0xc8b5 thru 0xe80d (0x1f59 bytes) 6213743(111 mod 256): TRUNCATE DOWN from 0x70d42 to 0x11ade ******WWWW 6213744(112 mod 256): COLLAPSE 0x6000 thru 0xffff (0xa000 bytes) 6213745(113 mod 256): SKIPPED (no operation) 6213746(114 mod 256): MAPREAD 0x2625 thru 0x7add (0x54b9 bytes) 6213747(115 mod 256): CLONE 0x2000 thru 0x6fff (0x5000 bytes) to 0x10000 thru 0x14fff 6213748(116 mod 256): SKIPPED (no operation) 6213749(117 mod 256): TRUNCATE UP from 0x15000 to 0x8d131 ******WWWW 6213750(118 mod 256): WRITE 0x82547 thru 0x88334 (0x5dee bytes) 6213751(119 mod 256): DEDUPE 0x7d000 thru 0x83fff (0x7000 bytes) to 0x22000 thru 0x28fff 6213752(120 mod 256): READ 0x11e69 thru 0x2864c (0x167e4 bytes) 6213753(121 mod 256): INSERT 0x41000 thru 0x45fff (0x5000 bytes) 6213754(122 mod 256): COPY 0x2ca4c thru 0x2ed9f (0x2354 bytes) to 0x2fef1 thru 0x32244 6213755(123 mod 256): MAPWRITE 0x70677 thru 0x8b993 (0x1b31d bytes) 6213756(124 mod 256): FALLOC 0x7229f thru 0x91158 (0x1eeb9 bytes) INTERIOR 6213757(125 mod 256): COLLAPSE 0x13000 thru 0x2bfff (0x19000 bytes) 6213758(126 mod 256): COPY 0x9271 thru 0xba34 (0x27c4 bytes) to 0x3227c thru 0x34a3f 6213759(127 mod 256): CLONE 0x23000 thru 0x2cfff (0xa000 bytes) to 0x6c000 thru 0x75fff ******JJJJ 6213760(128 mod 256): READ 0x44cff thru 0x4c4a1 (0x77a3 bytes) 6213761(129 mod 256): DEDUPE 0x60000 thru 0x73fff (0x14000 bytes) to 0x39000 thru 0x4cfff BBBB****** 6213762(130 mod 256): COLLAPSE 0x39000 thru 0x3ffff (0x7000 bytes) 6213763(131 mod 256): WRITE 0x57565 thru 0x5e710 (0x71ac bytes) 6213764(132 mod 256): MAPREAD 0x39c49 thru 0x4accd (0x11085 bytes) 6213765(133 mod 256): ZERO 0x4faf5 thru 0x6a5cc (0x1aad8 bytes) 6213766(134 mod 256): MAPREAD 0x57f8 thru 0x8c98 (0x34a1 bytes) 6213767(135 mod 256): MAPREAD 0x5cbd8 thru 0x72130 (0x15559 bytes) ***RRRR*** 6213768(136 mod 256): SKIPPED (no operation) 6213769(137 mod 256): INSERT 0x24000 thru 0x32fff (0xf000 bytes) 6213770(138 mod 256): COPY 0x32b0c thru 0x4d035 (0x1a52a bytes) to 0x4f97f thru 0x69ea8 6213771(139 mod 256): DEDUPE 0x3f000 thru 0x52fff (0x14000 bytes) to 0x23000 thru 0x36fff 6213772(140 mod 256): READ 0x6d9bf thru 0x81130 (0x13772 bytes) ***RRRR*** 6213773(141 mod 256): TRUNCATE DOWN from 0x81131 to 0x569c0 ******WWWW 6213774(142 mod 256): MAPREAD 0x354d5 thru 0x44e7b (0xf9a7 bytes) 6213775(143 mod 256): MAPWRITE 0x547c4 thru 0x60a8e (0xc2cb bytes) 6213776(144 mod 256): SKIPPED (no operation) 6213777(145 mod 256): WRITE 0x28ada thru 0x4356c (0x1aa93 bytes) 6213778(146 mod 256): ZERO 0x74c28 thru 0x91fec (0x1d3c5 bytes) 6213779(147 mod 256): INSERT 0x12000 thru 0x1cfff (0xb000 bytes) 6213780(148 mod 256): ZERO 0x30834 thru 0x330f7 (0x28c4 bytes) 6213781(149 mod 256): PUNCH 0x36080 thru 0x42edc (0xce5d bytes) 6213782(150 mod 256): DEDUPE 0x14000 thru 0x19fff (0x6000 bytes) to 0x49000 thru 0x4efff 6213783(151 mod 256): DEDUPE 0x51000 thru 0x5efff (0xe000 bytes) to 0x2a000 thru 0x37fff 6213784(152 mod 256): WRITE 0x2448e thru 0x400f5 (0x1bc68 bytes) 6213785(153 mod 256): ZERO 0x87615 thru 0x927bf (0xb1ab bytes) 6213786(154 mod 256): READ 0x5afc thru 0xa32c (0x4831 bytes) 6213787(155 mod 256): SKIPPED (no operation) 6213788(156 mod 256): ZERO 0x7aab0 thru 0x7e2b3 (0x3804 bytes) 6213789(157 mod 256): INSERT 0x45000 thru 0x58fff (0x14000 bytes) 6213790(158 mod 256): FALLOC 0x1a80e thru 0x289a3 (0xe195 bytes) INTERIOR 6213791(159 mod 256): SKIPPED (no operation) 6213792(160 mod 256): SKIPPED (no operation) 6213793(161 mod 256): FALLOC 0x2aca thru 0x20562 (0x1da98 bytes) INTERIOR 6213794(162 mod 256): ZERO 0x72fb9 thru 0x75887 (0x28cf bytes) 6213795(163 mod 256): COPY 0xa62e thru 0x218d0 (0x172a3 bytes) to 0x28ab1 thru 0x3fd53 6213796(164 mod 256): SKIPPED (no operation) 6213797(165 mod 256): COPY 0xa666 thru 0xf6a1 (0x503c bytes) to 0x353f0 thru 0x3a42b 6213798(166 mod 256): MAPREAD 0x69e3e thru 0x8675f (0x1c922 bytes) ***RRRR*** Log of operations saved to "/mnt/junk.fsxops"; replay with --replay-ops Correct content saved for comparison (maybe hexdump "/mnt/junk" vs "/mnt/junk.fsxgood") Silence is golden
On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote: > [resend with shorter 522.out file to keep us under the 300k maximum] > > On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote: > > Now that iomap has been converted, XFS is large folio safe. > > Indicate to the VFS that it can now create large folios for XFS. > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > --- > > fs/xfs/xfs_icache.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index da4af2142a2b..cdc39f576ca1 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -87,6 +87,7 @@ xfs_inode_alloc( > > /* VFS doesn't initialise i_mode or i_state! */ > > VFS_I(ip)->i_mode = 0; > > VFS_I(ip)->i_state = 0; > > + mapping_set_large_folios(VFS_I(ip)->i_mapping); > > > > XFS_STATS_INC(mp, vn_active); > > ASSERT(atomic_read(&ip->i_pincount) == 0); > > @@ -320,6 +321,7 @@ xfs_reinit_inode( > > inode->i_rdev = dev; > > inode->i_uid = uid; > > inode->i_gid = gid; > > + mapping_set_large_folios(inode->i_mapping); > > Hmm. Ever since 5.19-rc1, I've noticed that fsx in generic/522 now > reports file corruption after 20 minutes of runtime. The corruption is > surprisingly reproducible (522.out.bad attached below) in that I ran it > three times and always got the same bad offset (0x6e000) and always the > same opcode (6213798(166 mod 256) MAPREAD). > > I turned off multipage folios and now 522 has run for over an hour > without problems, so before I go do more debugging, does this ring a > bell to anyone? I tried bisecting, but that didn't yield anything productive and 5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will run without problems for at least 3-4 days after reverting this patch from -rc3. So I guess I have a blunt force fix if we can't figure this one out before 5.19 final, but I'd really rather not. Will keep trying this week. --D > [addendum: Apparently vger now has a 300K message size limit; if you > want the full output, see https://djwong.org/docs/522.out.txt ] > > --D > > QA output created by 522 > READ BAD DATA: offset = 0x69e3e, size = 0x1c922, fname = /mnt/junk > OFFSET GOOD BAD RANGE > 0x6e000 0x0000 0x9173 0x00000 > operation# (mod 256) for the bad data may be 145 > 0x6e001 0x0000 0x7391 0x00001 > operation# (mod 256) for the bad data may be 145 > 0x6e002 0x0000 0x9195 0x00002 > operation# (mod 256) for the bad data may be 145 > 0x6e003 0x0000 0x9591 0x00003 > operation# (mod 256) for the bad data may be 145 > 0x6e004 0x0000 0x91b5 0x00004 > operation# (mod 256) for the bad data may be 145 > 0x6e005 0x0000 0xb591 0x00005 > operation# (mod 256) for the bad data may be 145 > 0x6e006 0x0000 0x91e2 0x00006 > operation# (mod 256) for the bad data may be 145 > 0x6e007 0x0000 0xe291 0x00007 > operation# (mod 256) for the bad data may be 145 > 0x6e008 0x0000 0x919d 0x00008 > operation# (mod 256) for the bad data may be 145 > 0x6e009 0x0000 0x9d91 0x00009 > operation# (mod 256) for the bad data may be 145 > 0x6e00a 0x0000 0x91e8 0x0000a > operation# (mod 256) for the bad data may be 145 > 0x6e00b 0x0000 0xe891 0x0000b > operation# (mod 256) for the bad data may be 145 > 0x6e00c 0x0000 0x91c9 0x0000c > operation# (mod 256) for the bad data may be 145 > 0x6e00d 0x0000 0xc991 0x0000d > operation# (mod 256) for the bad data may be 145 > 0x6e00e 0x0000 0x9147 0x0000e > operation# (mod 256) for the bad data may be 145 > 0x6e00f 0x0000 0x4791 0x0000f > operation# (mod 256) for the bad data may be 145 > LOG DUMP (6213798 total operations): > > <snip> > > 6213732(100 mod 256): COLLAPSE 0x3b000 thru 0x4efff (0x14000 bytes) > 6213733(101 mod 256): READ 0x1953d thru 0x29311 (0xfdd5 bytes) > 6213734(102 mod 256): INSERT 0x14000 thru 0x2ffff (0x1c000 bytes) > 6213735(103 mod 256): COPY 0x1d381 thru 0x36d38 (0x199b8 bytes) to 0x64491 thru 0x7de48 ******EEEE > 6213736(104 mod 256): ZERO 0x74247 thru 0x927bf (0x1e579 bytes) > 6213737(105 mod 256): INSERT 0x8000 thru 0x16fff (0xf000 bytes) > 6213738(106 mod 256): READ 0x87aba thru 0x8ce48 (0x538f bytes) > 6213739(107 mod 256): TRUNCATE DOWN from 0x8ce49 to 0x46571 ******WWWW > 6213740(108 mod 256): SKIPPED (no operation) > 6213741(109 mod 256): ZERO 0x55674 thru 0x70d41 (0x1b6ce bytes) ******ZZZZ > 6213742(110 mod 256): PUNCH 0xc8b5 thru 0xe80d (0x1f59 bytes) > 6213743(111 mod 256): TRUNCATE DOWN from 0x70d42 to 0x11ade ******WWWW > 6213744(112 mod 256): COLLAPSE 0x6000 thru 0xffff (0xa000 bytes) > 6213745(113 mod 256): SKIPPED (no operation) > 6213746(114 mod 256): MAPREAD 0x2625 thru 0x7add (0x54b9 bytes) > 6213747(115 mod 256): CLONE 0x2000 thru 0x6fff (0x5000 bytes) to 0x10000 thru 0x14fff > 6213748(116 mod 256): SKIPPED (no operation) > 6213749(117 mod 256): TRUNCATE UP from 0x15000 to 0x8d131 ******WWWW > 6213750(118 mod 256): WRITE 0x82547 thru 0x88334 (0x5dee bytes) > 6213751(119 mod 256): DEDUPE 0x7d000 thru 0x83fff (0x7000 bytes) to 0x22000 thru 0x28fff > 6213752(120 mod 256): READ 0x11e69 thru 0x2864c (0x167e4 bytes) > 6213753(121 mod 256): INSERT 0x41000 thru 0x45fff (0x5000 bytes) > 6213754(122 mod 256): COPY 0x2ca4c thru 0x2ed9f (0x2354 bytes) to 0x2fef1 thru 0x32244 > 6213755(123 mod 256): MAPWRITE 0x70677 thru 0x8b993 (0x1b31d bytes) > 6213756(124 mod 256): FALLOC 0x7229f thru 0x91158 (0x1eeb9 bytes) INTERIOR > 6213757(125 mod 256): COLLAPSE 0x13000 thru 0x2bfff (0x19000 bytes) > 6213758(126 mod 256): COPY 0x9271 thru 0xba34 (0x27c4 bytes) to 0x3227c thru 0x34a3f > 6213759(127 mod 256): CLONE 0x23000 thru 0x2cfff (0xa000 bytes) to 0x6c000 thru 0x75fff ******JJJJ > 6213760(128 mod 256): READ 0x44cff thru 0x4c4a1 (0x77a3 bytes) > 6213761(129 mod 256): DEDUPE 0x60000 thru 0x73fff (0x14000 bytes) to 0x39000 thru 0x4cfff BBBB****** > 6213762(130 mod 256): COLLAPSE 0x39000 thru 0x3ffff (0x7000 bytes) > 6213763(131 mod 256): WRITE 0x57565 thru 0x5e710 (0x71ac bytes) > 6213764(132 mod 256): MAPREAD 0x39c49 thru 0x4accd (0x11085 bytes) > 6213765(133 mod 256): ZERO 0x4faf5 thru 0x6a5cc (0x1aad8 bytes) > 6213766(134 mod 256): MAPREAD 0x57f8 thru 0x8c98 (0x34a1 bytes) > 6213767(135 mod 256): MAPREAD 0x5cbd8 thru 0x72130 (0x15559 bytes) ***RRRR*** > 6213768(136 mod 256): SKIPPED (no operation) > 6213769(137 mod 256): INSERT 0x24000 thru 0x32fff (0xf000 bytes) > 6213770(138 mod 256): COPY 0x32b0c thru 0x4d035 (0x1a52a bytes) to 0x4f97f thru 0x69ea8 > 6213771(139 mod 256): DEDUPE 0x3f000 thru 0x52fff (0x14000 bytes) to 0x23000 thru 0x36fff > 6213772(140 mod 256): READ 0x6d9bf thru 0x81130 (0x13772 bytes) ***RRRR*** > 6213773(141 mod 256): TRUNCATE DOWN from 0x81131 to 0x569c0 ******WWWW > 6213774(142 mod 256): MAPREAD 0x354d5 thru 0x44e7b (0xf9a7 bytes) > 6213775(143 mod 256): MAPWRITE 0x547c4 thru 0x60a8e (0xc2cb bytes) > 6213776(144 mod 256): SKIPPED (no operation) > 6213777(145 mod 256): WRITE 0x28ada thru 0x4356c (0x1aa93 bytes) > 6213778(146 mod 256): ZERO 0x74c28 thru 0x91fec (0x1d3c5 bytes) > 6213779(147 mod 256): INSERT 0x12000 thru 0x1cfff (0xb000 bytes) > 6213780(148 mod 256): ZERO 0x30834 thru 0x330f7 (0x28c4 bytes) > 6213781(149 mod 256): PUNCH 0x36080 thru 0x42edc (0xce5d bytes) > 6213782(150 mod 256): DEDUPE 0x14000 thru 0x19fff (0x6000 bytes) to 0x49000 thru 0x4efff > 6213783(151 mod 256): DEDUPE 0x51000 thru 0x5efff (0xe000 bytes) to 0x2a000 thru 0x37fff > 6213784(152 mod 256): WRITE 0x2448e thru 0x400f5 (0x1bc68 bytes) > 6213785(153 mod 256): ZERO 0x87615 thru 0x927bf (0xb1ab bytes) > 6213786(154 mod 256): READ 0x5afc thru 0xa32c (0x4831 bytes) > 6213787(155 mod 256): SKIPPED (no operation) > 6213788(156 mod 256): ZERO 0x7aab0 thru 0x7e2b3 (0x3804 bytes) > 6213789(157 mod 256): INSERT 0x45000 thru 0x58fff (0x14000 bytes) > 6213790(158 mod 256): FALLOC 0x1a80e thru 0x289a3 (0xe195 bytes) INTERIOR > 6213791(159 mod 256): SKIPPED (no operation) > 6213792(160 mod 256): SKIPPED (no operation) > 6213793(161 mod 256): FALLOC 0x2aca thru 0x20562 (0x1da98 bytes) INTERIOR > 6213794(162 mod 256): ZERO 0x72fb9 thru 0x75887 (0x28cf bytes) > 6213795(163 mod 256): COPY 0xa62e thru 0x218d0 (0x172a3 bytes) to 0x28ab1 thru 0x3fd53 > 6213796(164 mod 256): SKIPPED (no operation) > 6213797(165 mod 256): COPY 0xa666 thru 0xf6a1 (0x503c bytes) to 0x353f0 thru 0x3a42b > 6213798(166 mod 256): MAPREAD 0x69e3e thru 0x8675f (0x1c922 bytes) ***RRRR*** > Log of operations saved to "/mnt/junk.fsxops"; replay with --replay-ops > Correct content saved for comparison > (maybe hexdump "/mnt/junk" vs "/mnt/junk.fsxgood") > Silence is golden
On Sun, Jun 26, 2022 at 09:15:27PM -0700, Darrick J. Wong wrote: > On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote: > > [resend with shorter 522.out file to keep us under the 300k maximum] > > > > On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote: > > > Now that iomap has been converted, XFS is large folio safe. > > > Indicate to the VFS that it can now create large folios for XFS. > > > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > --- > > > fs/xfs/xfs_icache.c | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > index da4af2142a2b..cdc39f576ca1 100644 > > > --- a/fs/xfs/xfs_icache.c > > > +++ b/fs/xfs/xfs_icache.c > > > @@ -87,6 +87,7 @@ xfs_inode_alloc( > > > /* VFS doesn't initialise i_mode or i_state! */ > > > VFS_I(ip)->i_mode = 0; > > > VFS_I(ip)->i_state = 0; > > > + mapping_set_large_folios(VFS_I(ip)->i_mapping); > > > > > > XFS_STATS_INC(mp, vn_active); > > > ASSERT(atomic_read(&ip->i_pincount) == 0); > > > @@ -320,6 +321,7 @@ xfs_reinit_inode( > > > inode->i_rdev = dev; > > > inode->i_uid = uid; > > > inode->i_gid = gid; > > > + mapping_set_large_folios(inode->i_mapping); > > > > Hmm. Ever since 5.19-rc1, I've noticed that fsx in generic/522 now > > reports file corruption after 20 minutes of runtime. The corruption is > > surprisingly reproducible (522.out.bad attached below) in that I ran it > > three times and always got the same bad offset (0x6e000) and always the > > same opcode (6213798(166 mod 256) MAPREAD). > > > > I turned off multipage folios and now 522 has run for over an hour > > without problems, so before I go do more debugging, does this ring a > > bell to anyone? > > I tried bisecting, but that didn't yield anything productive and > 5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will > run without problems for at least 3-4 days after reverting this patch > from -rc3. > > So I guess I have a blunt force fix if we can't figure this one out > before 5.19 final, but I'd really rather not. Will keep trying this > week. I'm on holiday for the next week, so I'm not going to be able to spend any time on this until then. I have a suspicion that this may be the same bug Zorro is seeing here: https://lore.kernel.org/linux-mm/20220613010850.6kmpenitmuct2osb@zlang-mailbox/ At least I hope it is, and finding a folio that has been freed would explain (apparent) file corruption.
On Sun, Jun 26, 2022 at 09:15:27PM -0700, Darrick J. Wong wrote: > On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote: > > [resend with shorter 522.out file to keep us under the 300k maximum] > > > > On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote: > > > Now that iomap has been converted, XFS is large folio safe. > > > Indicate to the VFS that it can now create large folios for XFS. > > > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > --- > > > fs/xfs/xfs_icache.c | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > index da4af2142a2b..cdc39f576ca1 100644 > > > --- a/fs/xfs/xfs_icache.c > > > +++ b/fs/xfs/xfs_icache.c > > > @@ -87,6 +87,7 @@ xfs_inode_alloc( > > > /* VFS doesn't initialise i_mode or i_state! */ > > > VFS_I(ip)->i_mode = 0; > > > VFS_I(ip)->i_state = 0; > > > + mapping_set_large_folios(VFS_I(ip)->i_mapping); > > > > > > XFS_STATS_INC(mp, vn_active); > > > ASSERT(atomic_read(&ip->i_pincount) == 0); > > > @@ -320,6 +321,7 @@ xfs_reinit_inode( > > > inode->i_rdev = dev; > > > inode->i_uid = uid; > > > inode->i_gid = gid; > > > + mapping_set_large_folios(inode->i_mapping); > > > > Hmm. Ever since 5.19-rc1, I've noticed that fsx in generic/522 now > > reports file corruption after 20 minutes of runtime. The corruption is > > surprisingly reproducible (522.out.bad attached below) in that I ran it > > three times and always got the same bad offset (0x6e000) and always the > > same opcode (6213798(166 mod 256) MAPREAD). > > > > I turned off multipage folios and now 522 has run for over an hour > > without problems, so before I go do more debugging, does this ring a > > bell to anyone? > > I tried bisecting, but that didn't yield anything productive and > 5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will > run without problems for at least 3-4 days after reverting this patch > from -rc3. Took 63 million ops and just over 3 hours before it failed here with a similar 16 byte map read corruption on the first 16 bytes of a page. Given the number of fallocate operations that lead up to the failure - 14 of last 23, plus 3 clone, 2 copy, 2 map read, 1 skip and the map write that it suggests the stale data came from - this smells of an invalidation issue... Cheers, Dave.
On Mon, Jun 27, 2022 at 03:10:40PM +0100, Matthew Wilcox wrote: > On Sun, Jun 26, 2022 at 09:15:27PM -0700, Darrick J. Wong wrote: > > On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote: > > > [resend with shorter 522.out file to keep us under the 300k maximum] > > > > > > On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote: > > > > Now that iomap has been converted, XFS is large folio safe. > > > > Indicate to the VFS that it can now create large folios for XFS. > > > > > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > --- > > > > fs/xfs/xfs_icache.c | 2 ++ > > > > 1 file changed, 2 insertions(+) > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > index da4af2142a2b..cdc39f576ca1 100644 > > > > --- a/fs/xfs/xfs_icache.c > > > > +++ b/fs/xfs/xfs_icache.c > > > > @@ -87,6 +87,7 @@ xfs_inode_alloc( > > > > /* VFS doesn't initialise i_mode or i_state! */ > > > > VFS_I(ip)->i_mode = 0; > > > > VFS_I(ip)->i_state = 0; > > > > + mapping_set_large_folios(VFS_I(ip)->i_mapping); > > > > > > > > XFS_STATS_INC(mp, vn_active); > > > > ASSERT(atomic_read(&ip->i_pincount) == 0); > > > > @@ -320,6 +321,7 @@ xfs_reinit_inode( > > > > inode->i_rdev = dev; > > > > inode->i_uid = uid; > > > > inode->i_gid = gid; > > > > + mapping_set_large_folios(inode->i_mapping); > > > > > > Hmm. Ever since 5.19-rc1, I've noticed that fsx in generic/522 now > > > reports file corruption after 20 minutes of runtime. The corruption is > > > surprisingly reproducible (522.out.bad attached below) in that I ran it > > > three times and always got the same bad offset (0x6e000) and always the > > > same opcode (6213798(166 mod 256) MAPREAD). > > > > > > I turned off multipage folios and now 522 has run for over an hour > > > without problems, so before I go do more debugging, does this ring a > > > bell to anyone? > > > > I tried bisecting, but that didn't yield anything productive and > > 5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will > > run without problems for at least 3-4 days after reverting this patch > > from -rc3. > > > > So I guess I have a blunt force fix if we can't figure this one out > > before 5.19 final, but I'd really rather not. Will keep trying this > > week. > > I'm on holiday for the next week, so I'm not going to be able to spend > any time on this until then. I have a suspicion that this may be the > same bug Zorro is seeing here: > > https://lore.kernel.org/linux-mm/20220613010850.6kmpenitmuct2osb@zlang-mailbox/ > > At least I hope it is, and finding a folio that has been freed would > explain (apparent) file corruption. Hm. I suppose it /could/ be a lost folio getting into the works somewhere. Today I remembered fsx -X, which makes this reproduce a bit faster (~3-8 minutes instead of 20-25). That has helped me to narrow things down a little more: - Turning off INSERT/COLLAPSE_RANGE doesn't make the problem go away, but it does make reading the fsx log much easier. - Turning off clone/dedupe (either via -J -B or formatting with -m reflink=0) makes the problem go away completely. If you define letting fsx run for 90 minutes as "completely". - Neutering vfs_dedupe_file_range_compare by replacing it with an -EBADE return doesn't affect the reproducibility, so it's not the comparison function misbehaving. - I modified fsx.c so that when there's file corruption, it'll report both the first 16 bytes of corruption as well as every corruption that happens on a page boundary. - I also modified run_fsx() to diff the good and junk files, and complain about any corruption happening on a page boundary. Now I see things like this: 2153984( 0 mod 256): SKIPPED (no operation) 2153985( 1 mod 256): DEDUPE 0xf000 thru 0x23fff (0x15000 bytes) to 0x2a000 thru 0x3efff ******BBBB 2153986( 2 mod 256): COPY 0xe794 thru 0x2ae41 (0x1c6ae bytes) to 0x60ac4 thru 0x7d171 2153987( 3 mod 256): TRUNCATE DOWN from 0x7d172 to 0x535da 2153988( 4 mod 256): SKIPPED (no operation) 2153989( 5 mod 256): MAPREAD 0x40b93 thru 0x535d9 (0x12a47 bytes) 2153990( 6 mod 256): COPY 0x5edd thru 0x20282 (0x1a3a6 bytes) to 0x3a9aa thru 0x54d4f 2153991( 7 mod 256): SKIPPED (no operation) 2153992( 8 mod 256): SKIPPED (no operation) 2153993( 9 mod 256): ZERO 0x542d3 thru 0x67006 (0x12d34 bytes) 2153994( 10 mod 256): COPY 0x42cf6 thru 0x538a7 (0x10bb2 bytes) to 0x23fe7 thru 0x34b98 ******EEEE 2153995( 11 mod 256): MAPWRITE 0x5a1fc thru 0x6b067 (0x10e6c bytes) 2153996( 12 mod 256): SKIPPED (no operation) 2153997( 13 mod 256): CLONE 0x38000 thru 0x38fff (0x1000 bytes) to 0x77000 thru 0x77fff 2153998( 14 mod 256): FALLOC 0x49bdd thru 0x62a55 (0x18e78 bytes) INTERIOR 2153999( 15 mod 256): CLONE 0xf000 thru 0x1bfff (0xd000 bytes) to 0x2c000 thru 0x38fff ******JJJJ Log of operations saved to "/mnt/junk.fsxops"; replay with --replay-ops Correct content saved for comparison (maybe hexdump "/mnt/junk" vs "/mnt/junk.fsxgood") junk file 177 -02e000 ec 20 ec 5a ec 78 ec b2 ec e6 ec 1e ec 43 ec 0f -02f000 ec 30 ec 32 ec 4c ec ac ec 5c ec d2 ec 62 ec d3 -030000 ec 73 ec ce ec 8c ec cb ec 94 ec 59 ec 81 ec 34 +02e000 77 db f1 db ba db 01 db d5 db 9c db 4d db de db +02f000 b3 d8 35 d8 e2 d8 bb d8 a4 d8 c8 d8 5b d8 83 d8 +030000 23 d8 c8 d8 22 d8 da d8 97 d8 e0 d8 7e d8 61 d8 When I remount the test filesystem, I see further corruption: $ diff -Naurp <(od -tx1 -Ax -c $TEST_DIR/junk.fsxgood) <(od -tx1 -Ax -c $TEST_DIR/junk) | grep '^[+-]0..000' -011000 ec 20 ec 5a ec 78 ec b2 ec e6 ec 1e ec 43 ec 0f -012000 ec 30 ec 32 ec 4c ec ac ec 5c ec d2 ec 62 ec d3 -013000 ec 73 ec ce ec 8c ec cb ec 94 ec 59 ec 81 ec 34 +011000 77 db f1 db ba db 01 db d5 db 9c db 4d db de db +012000 b3 d8 35 d8 e2 d8 bb d8 a4 d8 c8 d8 5b d8 83 d8 +013000 23 d8 c8 d8 22 d8 da d8 97 d8 e0 d8 7e d8 61 d8 -02e000 ec 20 ec 5a ec 78 ec b2 ec e6 ec 1e ec 43 ec 0f -02f000 ec 30 ec 32 ec 4c ec ac ec 5c ec d2 ec 62 ec d3 -030000 ec 73 ec ce ec 8c ec cb ec 94 ec 59 ec 81 ec 34 +02e000 77 db f1 db ba db 01 db d5 db 9c db 4d db de db +02f000 b3 d8 35 d8 e2 d8 bb d8 a4 d8 c8 d8 5b d8 83 d8 +030000 23 d8 c8 d8 22 d8 da d8 97 d8 e0 d8 7e d8 61 d8 This is really quite strange! The same corruption patterns we saw at pages 0x2e - 0x30 now appear at 0x11-0x13 after the remount! By comparison, the junk.fsxgood file only contains this 77/db/f1 sequence at: $ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep '77 db f1' 008530 db 34 db 77 db f1 db ba db 01 db d5 db 9c db 4d 03d000 77 db f1 db ba db 01 db d5 db 9c db 4d db de db Curiously, the same byte trios at 0x2f000 and 0x30000 have similar repetitions at similar looking offsets: $ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep 'b3 d8 35' 009530 d8 bb d8 b3 d8 35 d8 e2 d8 bb d8 a4 d8 c8 d8 5b 03e000 b3 d8 35 d8 e2 d8 bb d8 a4 d8 c8 d8 5b d8 83 d8 $ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep '23 d8 c8' 00a530 d8 f5 d8 23 d8 c8 d8 22 d8 da d8 97 d8 e0 d8 7e 03f000 23 d8 c8 d8 22 d8 da d8 97 d8 e0 d8 7e d8 61 d8 Though the only pattern that happens consistently is that garbage bytes end up at the reflink dest, and later at the reflink source. I never see any VM_BUG_ON_FOLIO asserts, nor does KASAN report anything. I also added a debug function to dump the folios it finds in the pagecache for the fsx junk file, but nothing looks odd: 522-5099 [001] 491.954659: console: [U] FSX FAILURE xfs_io-5125 [002] 491.961232: console: XFS (sda): EXPERIMENTAL online scrub feature in use. Use at your own risk! xfs_io-5125 [002] 491.961238: bprint: filemap_dump: ino 0xb1 pos 0x0 pfn 0x515cc order 0 xfs_io-5125 [002] 491.961238: bprint: filemap_dump: ino 0xb1 pos 0x1000 pfn 0x515cd order 0 xfs_io-5125 [002] 491.961239: bprint: filemap_dump: ino 0xb1 pos 0x2000 pfn 0x515ce order 0 xfs_io-5125 [002] 491.961239: bprint: filemap_dump: ino 0xb1 pos 0x3000 pfn 0x515cf order 0 xfs_io-5125 [002] 491.961239: bprint: filemap_dump: ino 0xb1 pos 0x4000 pfn 0x50c48 order 0 xfs_io-5125 [002] 491.961240: bprint: filemap_dump: ino 0xb1 pos 0x5000 pfn 0x50c49 order 0 xfs_io-5125 [002] 491.961240: bprint: filemap_dump: ino 0xb1 pos 0x6000 pfn 0x50c4a order 0 xfs_io-5125 [002] 491.961241: bprint: filemap_dump: ino 0xb1 pos 0x7000 pfn 0xc8a8 order 0 xfs_io-5125 [002] 491.961241: bprint: filemap_dump: ino 0xb1 pos 0x8000 pfn 0x50988 order 2 xfs_io-5125 [002] 491.961241: bprint: filemap_dump: ino 0xb1 pos 0xc000 pfn 0x509e0 order 2 xfs_io-5125 [002] 491.961242: bprint: filemap_dump: ino 0xb1 pos 0x10000 pfn 0x4db64 order 2 xfs_io-5125 [002] 491.961242: bprint: filemap_dump: ino 0xb1 pos 0x14000 pfn 0x50c4c order 0 xfs_io-5125 [002] 491.961243: bprint: filemap_dump: ino 0xb1 pos 0x15000 pfn 0x12485 order 0 xfs_io-5125 [002] 491.961243: bprint: filemap_dump: ino 0xb1 pos 0x16000 pfn 0x50c4d order 0 xfs_io-5125 [002] 491.961243: bprint: filemap_dump: ino 0xb1 pos 0x17000 pfn 0x50c4e order 0 xfs_io-5125 [002] 491.961244: bprint: filemap_dump: ino 0xb1 pos 0x18000 pfn 0x4eef8 order 2 xfs_io-5125 [002] 491.961244: bprint: filemap_dump: ino 0xb1 pos 0x1c000 pfn 0x4eefc order 2 xfs_io-5125 [002] 491.961245: bprint: filemap_dump: ino 0xb1 pos 0x20000 pfn 0x4eef0 order 2 xfs_io-5125 [002] 491.961245: bprint: filemap_dump: ino 0xb1 pos 0x24000 pfn 0x50f2c order 2 xfs_io-5125 [002] 491.961245: bprint: filemap_dump: ino 0xb1 pos 0x28000 pfn 0x50f28 order 2 xfs_io-5125 [002] 491.961246: bprint: filemap_dump: ino 0xb1 pos 0x2c000 pfn 0x50f24 order 2 xfs_io-5125 [002] 491.961246: bprint: filemap_dump: ino 0xb1 pos 0x30000 pfn 0x50f20 order 2 I'll keep digging... --D
On Mon, Jun 27, 2022 at 03:16:19PM -0700, Darrick J. Wong wrote: > On Mon, Jun 27, 2022 at 03:10:40PM +0100, Matthew Wilcox wrote: > > On Sun, Jun 26, 2022 at 09:15:27PM -0700, Darrick J. Wong wrote: > > > On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote: > > > > [resend with shorter 522.out file to keep us under the 300k maximum] > > > > > > > > On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote: > > > > > Now that iomap has been converted, XFS is large folio safe. > > > > > Indicate to the VFS that it can now create large folios for XFS. > > > > > > > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > > --- > > > > > fs/xfs/xfs_icache.c | 2 ++ > > > > > 1 file changed, 2 insertions(+) > > > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > index da4af2142a2b..cdc39f576ca1 100644 > > > > > --- a/fs/xfs/xfs_icache.c > > > > > +++ b/fs/xfs/xfs_icache.c > > > > > @@ -87,6 +87,7 @@ xfs_inode_alloc( > > > > > /* VFS doesn't initialise i_mode or i_state! */ > > > > > VFS_I(ip)->i_mode = 0; > > > > > VFS_I(ip)->i_state = 0; > > > > > + mapping_set_large_folios(VFS_I(ip)->i_mapping); > > > > > > > > > > XFS_STATS_INC(mp, vn_active); > > > > > ASSERT(atomic_read(&ip->i_pincount) == 0); > > > > > @@ -320,6 +321,7 @@ xfs_reinit_inode( > > > > > inode->i_rdev = dev; > > > > > inode->i_uid = uid; > > > > > inode->i_gid = gid; > > > > > + mapping_set_large_folios(inode->i_mapping); > > > > > > > > Hmm. Ever since 5.19-rc1, I've noticed that fsx in generic/522 now > > > > reports file corruption after 20 minutes of runtime. The corruption is > > > > surprisingly reproducible (522.out.bad attached below) in that I ran it > > > > three times and always got the same bad offset (0x6e000) and always the > > > > same opcode (6213798(166 mod 256) MAPREAD). > > > > > > > > I turned off multipage folios and now 522 has run for over an hour > > > > without problems, so before I go do more debugging, does this ring a > > > > bell to anyone? > > > > > > I tried bisecting, but that didn't yield anything productive and > > > 5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will > > > run without problems for at least 3-4 days after reverting this patch > > > from -rc3. > > > > > > So I guess I have a blunt force fix if we can't figure this one out > > > before 5.19 final, but I'd really rather not. Will keep trying this > > > week. > > > > I'm on holiday for the next week, so I'm not going to be able to spend > > any time on this until then. I have a suspicion that this may be the > > same bug Zorro is seeing here: > > > > https://lore.kernel.org/linux-mm/20220613010850.6kmpenitmuct2osb@zlang-mailbox/ > > > > At least I hope it is, and finding a folio that has been freed would > > explain (apparent) file corruption. > > Hm. I suppose it /could/ be a lost folio getting into the works > somewhere. > > Today I remembered fsx -X, which makes this reproduce a bit faster (~3-8 > minutes instead of 20-25). That has helped me to narrow things down a > little more: > > - Turning off INSERT/COLLAPSE_RANGE doesn't make the problem go away, > but it does make reading the fsx log much easier. > > - Turning off clone/dedupe (either via -J -B or formatting with -m > reflink=0) makes the problem go away completely. If you define > letting fsx run for 90 minutes as "completely". > > - Neutering vfs_dedupe_file_range_compare by replacing it with an -EBADE > return doesn't affect the reproducibility, so it's not the comparison > function misbehaving. > - I modified fsx.c so that when there's file corruption, it'll report > both the first 16 bytes of corruption as well as every corruption that > happens on a page boundary. > > - I also modified run_fsx() to diff the good and junk files, and > complain about any corruption happening on a page boundary. Now I see > things like this: > > 2153984( 0 mod 256): SKIPPED (no operation) > 2153985( 1 mod 256): DEDUPE 0xf000 thru 0x23fff (0x15000 bytes) to 0x2a000 thru 0x3efff ******BBBB > 2153986( 2 mod 256): COPY 0xe794 thru 0x2ae41 (0x1c6ae bytes) to 0x60ac4 thru 0x7d171 > 2153987( 3 mod 256): TRUNCATE DOWN from 0x7d172 to 0x535da > 2153988( 4 mod 256): SKIPPED (no operation) > 2153989( 5 mod 256): MAPREAD 0x40b93 thru 0x535d9 (0x12a47 bytes) > 2153990( 6 mod 256): COPY 0x5edd thru 0x20282 (0x1a3a6 bytes) to 0x3a9aa thru 0x54d4f > 2153991( 7 mod 256): SKIPPED (no operation) > 2153992( 8 mod 256): SKIPPED (no operation) > 2153993( 9 mod 256): ZERO 0x542d3 thru 0x67006 (0x12d34 bytes) > 2153994( 10 mod 256): COPY 0x42cf6 thru 0x538a7 (0x10bb2 bytes) to 0x23fe7 thru 0x34b98 ******EEEE > 2153995( 11 mod 256): MAPWRITE 0x5a1fc thru 0x6b067 (0x10e6c bytes) > 2153996( 12 mod 256): SKIPPED (no operation) > 2153997( 13 mod 256): CLONE 0x38000 thru 0x38fff (0x1000 bytes) to 0x77000 thru 0x77fff > 2153998( 14 mod 256): FALLOC 0x49bdd thru 0x62a55 (0x18e78 bytes) INTERIOR > 2153999( 15 mod 256): CLONE 0xf000 thru 0x1bfff (0xd000 bytes) to 0x2c000 thru 0x38fff ******JJJJ > Log of operations saved to "/mnt/junk.fsxops"; replay with --replay-ops > Correct content saved for comparison > (maybe hexdump "/mnt/junk" vs "/mnt/junk.fsxgood") > junk file 177 > -02e000 ec 20 ec 5a ec 78 ec b2 ec e6 ec 1e ec 43 ec 0f > -02f000 ec 30 ec 32 ec 4c ec ac ec 5c ec d2 ec 62 ec d3 > -030000 ec 73 ec ce ec 8c ec cb ec 94 ec 59 ec 81 ec 34 > +02e000 77 db f1 db ba db 01 db d5 db 9c db 4d db de db > +02f000 b3 d8 35 d8 e2 d8 bb d8 a4 d8 c8 d8 5b d8 83 d8 > +030000 23 d8 c8 d8 22 d8 da d8 97 d8 e0 d8 7e d8 61 d8 > > When I remount the test filesystem, I see further corruption: > > $ diff -Naurp <(od -tx1 -Ax -c $TEST_DIR/junk.fsxgood) <(od -tx1 -Ax -c $TEST_DIR/junk) | grep '^[+-]0..000' > -011000 ec 20 ec 5a ec 78 ec b2 ec e6 ec 1e ec 43 ec 0f > -012000 ec 30 ec 32 ec 4c ec ac ec 5c ec d2 ec 62 ec d3 > -013000 ec 73 ec ce ec 8c ec cb ec 94 ec 59 ec 81 ec 34 > +011000 77 db f1 db ba db 01 db d5 db 9c db 4d db de db > +012000 b3 d8 35 d8 e2 d8 bb d8 a4 d8 c8 d8 5b d8 83 d8 > +013000 23 d8 c8 d8 22 d8 da d8 97 d8 e0 d8 7e d8 61 d8 > -02e000 ec 20 ec 5a ec 78 ec b2 ec e6 ec 1e ec 43 ec 0f > -02f000 ec 30 ec 32 ec 4c ec ac ec 5c ec d2 ec 62 ec d3 > -030000 ec 73 ec ce ec 8c ec cb ec 94 ec 59 ec 81 ec 34 > +02e000 77 db f1 db ba db 01 db d5 db 9c db 4d db de db > +02f000 b3 d8 35 d8 e2 d8 bb d8 a4 d8 c8 d8 5b d8 83 d8 > +030000 23 d8 c8 d8 22 d8 da d8 97 d8 e0 d8 7e d8 61 d8 > > This is really quite strange! The same corruption patterns we saw at > pages 0x2e - 0x30 now appear at 0x11-0x13 after the remount! Hmmmm - look at what the last operation before failure does - it clones 0xf000-0x1bfff to 0x2c000-0x38fff. IOWs, those ranges *should* be identical and the the corruption is actually occuring at 0x11000-0x13fff. It's not until that range gets cloned to 0x2e000-0x30fff that the corruption is detected. So we're looking in the wrong spot for the page cache corruption - we need to be looking at operations over the range 0x11000-0x13fff for misbehaviour, not where fsx detected the corrupt data. > By comparison, the junk.fsxgood file only contains this 77/db/f1 > sequence at: > > $ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep '77 db f1' > 008530 db 34 db 77 db f1 db ba db 01 db d5 db 9c db 4d > 03d000 77 db f1 db ba db 01 db d5 db 9c db 4d db de db > > Curiously, the same byte trios at 0x2f000 and 0x30000 have similar > repetitions at similar looking offsets: > > $ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep 'b3 d8 35' > 009530 d8 bb d8 b3 d8 35 d8 e2 d8 bb d8 a4 d8 c8 d8 5b > 03e000 b3 d8 35 d8 e2 d8 bb d8 a4 d8 c8 d8 5b d8 83 d8 > $ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep '23 d8 c8' > 00a530 d8 f5 d8 23 d8 c8 d8 22 d8 da d8 97 d8 e0 d8 7e > 03f000 23 d8 c8 d8 22 d8 da d8 97 d8 e0 d8 7e d8 61 d8 > > Though the only pattern that happens consistently is that garbage bytes > end up at the reflink dest, and later at the reflink source. I never > see any VM_BUG_ON_FOLIO asserts, nor does KASAN report anything. Smells like the page cache over the clone source is not getting marked dirty and/or flushed to disk correctly before the clone is run. It then shares the extent with stale data to the new location (the destination) which then fails the contents validation. Do we have a case where we are only writing back the head page of the multipage folio? > I also added a debug function to dump the folios it finds in the > pagecache for the fsx junk file, but nothing looks odd: > > 522-5099 [001] 491.954659: console: [U] FSX FAILURE > xfs_io-5125 [002] 491.961232: console: XFS (sda): EXPERIMENTAL online scrub feature in use. Use at your own risk! > xfs_io-5125 [002] 491.961238: bprint: filemap_dump: ino 0xb1 pos 0x0 pfn 0x515cc order 0 > xfs_io-5125 [002] 491.961238: bprint: filemap_dump: ino 0xb1 pos 0x1000 pfn 0x515cd order 0 > xfs_io-5125 [002] 491.961239: bprint: filemap_dump: ino 0xb1 pos 0x2000 pfn 0x515ce order 0 > xfs_io-5125 [002] 491.961239: bprint: filemap_dump: ino 0xb1 pos 0x3000 pfn 0x515cf order 0 > xfs_io-5125 [002] 491.961239: bprint: filemap_dump: ino 0xb1 pos 0x4000 pfn 0x50c48 order 0 > xfs_io-5125 [002] 491.961240: bprint: filemap_dump: ino 0xb1 pos 0x5000 pfn 0x50c49 order 0 > xfs_io-5125 [002] 491.961240: bprint: filemap_dump: ino 0xb1 pos 0x6000 pfn 0x50c4a order 0 > xfs_io-5125 [002] 491.961241: bprint: filemap_dump: ino 0xb1 pos 0x7000 pfn 0xc8a8 order 0 > xfs_io-5125 [002] 491.961241: bprint: filemap_dump: ino 0xb1 pos 0x8000 pfn 0x50988 order 2 > xfs_io-5125 [002] 491.961241: bprint: filemap_dump: ino 0xb1 pos 0xc000 pfn 0x509e0 order 2 > xfs_io-5125 [002] 491.961242: bprint: filemap_dump: ino 0xb1 pos 0x10000 pfn 0x4db64 order 2 So this is the folio that likely has the problem (the source)... > xfs_io-5125 [002] 491.961242: bprint: filemap_dump: ino 0xb1 pos 0x14000 pfn 0x50c4c order 0 > xfs_io-5125 [002] 491.961243: bprint: filemap_dump: ino 0xb1 pos 0x15000 pfn 0x12485 order 0 > xfs_io-5125 [002] 491.961243: bprint: filemap_dump: ino 0xb1 pos 0x16000 pfn 0x50c4d order 0 > xfs_io-5125 [002] 491.961243: bprint: filemap_dump: ino 0xb1 pos 0x17000 pfn 0x50c4e order 0 > xfs_io-5125 [002] 491.961244: bprint: filemap_dump: ino 0xb1 pos 0x18000 pfn 0x4eef8 order 2 > xfs_io-5125 [002] 491.961244: bprint: filemap_dump: ino 0xb1 pos 0x1c000 pfn 0x4eefc order 2 > xfs_io-5125 [002] 491.961245: bprint: filemap_dump: ino 0xb1 pos 0x20000 pfn 0x4eef0 order 2 > xfs_io-5125 [002] 491.961245: bprint: filemap_dump: ino 0xb1 pos 0x24000 pfn 0x50f2c order 2 > xfs_io-5125 [002] 491.961245: bprint: filemap_dump: ino 0xb1 pos 0x28000 pfn 0x50f28 order 2 > xfs_io-5125 [002] 491.961246: bprint: filemap_dump: ino 0xb1 pos 0x2c000 pfn 0x50f24 order 2 ... not the one at the destination here. Cheers, Dave.
[cc linux-mm@kvack.org] On Mon, Jun 27, 2022 at 03:16:19PM -0700, Darrick J. Wong wrote: > On Mon, Jun 27, 2022 at 03:10:40PM +0100, Matthew Wilcox wrote: > > On Sun, Jun 26, 2022 at 09:15:27PM -0700, Darrick J. Wong wrote: > > > On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote: > > > > [resend with shorter 522.out file to keep us under the 300k maximum] > > > > > > > > On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote: > > > > > Now that iomap has been converted, XFS is large folio safe. > > > > > Indicate to the VFS that it can now create large folios for XFS. > > > > > > > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > > --- > > > > > fs/xfs/xfs_icache.c | 2 ++ > > > > > 1 file changed, 2 insertions(+) > > > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > index da4af2142a2b..cdc39f576ca1 100644 > > > > > --- a/fs/xfs/xfs_icache.c > > > > > +++ b/fs/xfs/xfs_icache.c > > > > > @@ -87,6 +87,7 @@ xfs_inode_alloc( > > > > > /* VFS doesn't initialise i_mode or i_state! */ > > > > > VFS_I(ip)->i_mode = 0; > > > > > VFS_I(ip)->i_state = 0; > > > > > + mapping_set_large_folios(VFS_I(ip)->i_mapping); > > > > > > > > > > XFS_STATS_INC(mp, vn_active); > > > > > ASSERT(atomic_read(&ip->i_pincount) == 0); > > > > > @@ -320,6 +321,7 @@ xfs_reinit_inode( > > > > > inode->i_rdev = dev; > > > > > inode->i_uid = uid; > > > > > inode->i_gid = gid; > > > > > + mapping_set_large_folios(inode->i_mapping); > > > > > > > > Hmm. Ever since 5.19-rc1, I've noticed that fsx in generic/522 now > > > > reports file corruption after 20 minutes of runtime. The corruption is > > > > surprisingly reproducible (522.out.bad attached below) in that I ran it > > > > three times and always got the same bad offset (0x6e000) and always the > > > > same opcode (6213798(166 mod 256) MAPREAD). > > > > > > > > I turned off multipage folios and now 522 has run for over an hour > > > > without problems, so before I go do more debugging, does this ring a > > > > bell to anyone? > > > > > > I tried bisecting, but that didn't yield anything productive and > > > 5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will > > > run without problems for at least 3-4 days after reverting this patch > > > from -rc3. > > > > > > So I guess I have a blunt force fix if we can't figure this one out > > > before 5.19 final, but I'd really rather not. Will keep trying this > > > week. > > > > I'm on holiday for the next week, so I'm not going to be able to spend > > any time on this until then. I have a suspicion that this may be the > > same bug Zorro is seeing here: > > > > https://lore.kernel.org/linux-mm/20220613010850.6kmpenitmuct2osb@zlang-mailbox/ > > > > At least I hope it is, and finding a folio that has been freed would > > explain (apparent) file corruption. > > Hm. I suppose it /could/ be a lost folio getting into the works > somewhere. > > Today I remembered fsx -X, which makes this reproduce a bit faster (~3-8 > minutes instead of 20-25). That has helped me to narrow things down a > little more: > > - Turning off INSERT/COLLAPSE_RANGE doesn't make the problem go away, > but it does make reading the fsx log much easier. > > - Turning off clone/dedupe (either via -J -B or formatting with -m > reflink=0) makes the problem go away completely. If you define > letting fsx run for 90 minutes as "completely". So using this technique, I've discovered that there's a dirty page accounting leak that eventually results in fsx hanging in balance_dirty_pages(). [15300.670773] sysrq: Show Blocked State [15300.672712] task:fsx state:D stack:11600 pid: 5607 ppid: 5601 flags:0x00004004 [15300.676785] Call Trace: [15300.678061] <TASK> [15300.679171] __schedule+0x310/0x9f0 [15300.681096] schedule+0x4b/0xb0 [15300.683385] schedule_timeout+0x88/0x160 [15300.685407] ? do_raw_spin_unlock+0x4b/0xa0 [15300.687567] ? timer_migration_handler+0x90/0x90 [15300.690277] io_schedule_timeout+0x4c/0x70 [15300.692346] balance_dirty_pages_ratelimited+0x259/0xb60 [15300.695205] fault_dirty_shared_page+0xef/0x100 [15300.697650] do_wp_page+0x414/0x760 [15300.699819] __handle_mm_fault+0xc59/0x1730 [15300.702226] ? do_mmap+0x348/0x540 [15300.704418] handle_mm_fault+0x7a/0x1c0 [15300.706658] exc_page_fault+0x1da/0x780 [15300.709020] asm_exc_page_fault+0x27/0x30 # cat /proc/meminfo MemTotal: 16292348 kB MemFree: 15298308 kB MemAvailable: 15219180 kB Buffers: 71336 kB Cached: 209288 kB SwapCached: 0 kB Active: 141840 kB Inactive: 254748 kB Active(anon): 572 kB Inactive(anon): 128000 kB Active(file): 141268 kB Inactive(file): 126748 kB Unevictable: 12344 kB Mlocked: 12344 kB SwapTotal: 497976 kB SwapFree: 497976 kB Dirty: 2825676 kB <<<<<<<<< Writeback: 0 kB AnonPages: 125072 kB Mapped: 65592 kB Shmem: 2128 kB KReclaimable: 49424 kB Slab: 110992 kB SReclaimable: 49424 kB SUnreclaim: 61568 kB KernelStack: 4720 kB PageTables: 4000 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 8644148 kB Committed_AS: 277864 kB VmallocTotal: 34359738367 kB VmallocUsed: 46604 kB VmallocChunk: 0 kB Percpu: 5376 kB AnonHugePages: 30720 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 61276 kB DirectMap2M: 6230016 kB DirectMap1G: 29360128 kB As you can see, the system only has 1GB of 16GB of memory used, but it's got 2.8GB of dirty pages accounted and so any write that is done will now hang on the dirty page throttle. sysrq-m shows: [15443.617242] sysrq: Show Memory [15443.618465] Mem-Info: [15443.619428] active_anon:143 inactive_anon:32000 isolated_anon:0 [15443.619428] active_file:35317 inactive_file:31687 isolated_file:0 [15443.619428] unevictable:3086 dirty:706420 writeback:0 [15443.619428] slab_reclaimable:12356 slab_unreclaimable:15392 [15443.619428] mapped:16398 shmem:532 pagetables:1000 bounce:0 [15443.619428] kernel_misc_reclaimable:0 [15443.619428] free:3825037 free_pcp:34496 free_cma:0 [15443.632592] Node 0 active_anon:152kB inactive_anon:17812kB active_file:27304kB inactive_file:19200kB unevictable:832kB isolated(anon ):0kB isolated(file):0kB mapped:8820kB dirty:632kB writeback:0kB shmem:224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writebac k_tmp:0kB kernel_stack:1332kB pagetables:812kB all_unreclaimable? no [15443.643111] Node 1 active_anon:168kB inactive_anon:54348kB active_file:52160kB inactive_file:19808kB unevictable:6932kB isolated(ano n):0kB isolated(file):0kB mapped:29512kB dirty:0kB writeback:0kB shmem:596kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 14336kB writ eback_tmp:0kB kernel_stack:836kB pagetables:1212kB all_unreclaimable? no [15443.653719] Node 2 active_anon:152kB inactive_anon:31780kB active_file:32604kB inactive_file:62196kB unevictable:3324kB isolated(ano n):0kB isolated(file):0kB mapped:11560kB dirty:0kB writeback:0kB shmem:216kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 8192kB write back_tmp:0kB kernel_stack:1268kB pagetables:1084kB all_unreclaimable? no [15443.664282] Node 3 active_anon:100kB inactive_anon:24060kB active_file:29200kB inactive_file:25544kB unevictable:1256kB isolated(ano n):0kB isolated(file):0kB mapped:15700kB dirty:2825048kB writeback:0kB shmem:1092kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 10240 kB writeback_tmp:0kB kernel_stack:1268kB pagetables:892kB all_unreclaimable? no [15443.673605] Node 0 DMA free:15360kB boost:0kB min:124kB low:152kB high:180kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0 kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0 kB local_pcp:0kB free_cma:0kB [15443.680976] lowmem_reserve[]: 0 2911 7902 7902 7902 [15443.682360] Node 0 DMA32 free:2981444kB boost:0kB min:24744kB low:30928kB high:37112kB reserved_highatomic:0KB active_anon:0kB inact ive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129180kB managed:2986368kB mlocked:0kB bounce: 0kB free_pcp:4916kB local_pcp:0kB free_cma:0kB [15443.689280] lowmem_reserve[]: 0 0 4990 4990 4990 [15443.690450] Node 0 Normal free:4968904kB boost:0kB min:42412kB low:53012kB high:63612kB reserved_highatomic:0KB active_anon:152kB in active_anon:17812kB active_file:27304kB inactive_file:19200kB unevictable:832kB writepending:632kB present:5242880kB managed:5110756kB mlocked:832kB bounce:0kB free_pcp:41008kB local_pcp:5944kB free_cma:0kB [15443.697102] lowmem_reserve[]: 0 0 0 0 0 [15443.697986] Node 1 Normal free:5806356kB boost:0kB min:51384kB low:64228kB high:77072kB reserved_highatomic:0KB active_anon:168kB in active_anon:54348kB active_file:52160kB inactive_file:19808kB unevictable:6932kB writepending:0kB present:6291456kB managed:6192112kB m locked:6932kB bounce:0kB free_pcp:68644kB local_pcp:15328kB free_cma:0kB [15443.704154] lowmem_reserve[]: 0 0 0 0 0 [15443.705005] Node 2 Normal free:714616kB boost:0kB min:8556kB low:10692kB high:12828kB reserved_highatomic:0KB active_anon:152kB inac tive_anon:31780kB active_file:32604kB inactive_file:62196kB unevictable:3324kB writepending:0kB present:1048576kB managed:1031152kB mlo cked:3324kB bounce:0kB free_pcp:10904kB local_pcp:0kB free_cma:0kB [15443.712094] lowmem_reserve[]: 0 0 0 0 0 [15443.713270] Node 3 Normal free:813468kB boost:0kB min:7936kB low:9920kB high:11904kB reserved_highatomic:0KB active_anon:100kB inact ive_anon:24060kB active_file:29200kB inactive_file:25544kB unevictable:1256kB writepending:2825212kB present:1048576kB managed:956600kB mlocked:1256kB bounce:0kB free_pcp:12508kB local_pcp:388kB free_cma:0kB [15443.721013] lowmem_reserve[]: 0 0 0 0 0 [15443.721958] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15360kB [15443.725048] Node 0 DMA32: 3*4kB (UM) 1*8kB (M) 1*16kB (M) 1*32kB (M) 2*64kB (M) 1*128kB (M) 3*256kB (UM) 3*512kB (UM) 3*1024kB (UM) 3*2048kB (UM) 725*4096kB (M) = 2981444kB [15443.729512] Node 0 Normal: 1714*4kB (UME) 1504*8kB (UME) 988*16kB (UME) 562*32kB (UME) 268*64kB (UME) 96*128kB (UME) 27*256kB (UME) 5*512kB (UM) 1*1024kB (E) 5*2048kB (UME) 1188*4096kB (M) = 4968904kB [15443.734684] Node 1 Normal: 977*4kB (UM) 1086*8kB (UME) 406*16kB (UME) 210*32kB (UM) 133*64kB (UME) 116*128kB (UME) 37*256kB (UM) 18* 512kB (UM) 12*1024kB (UME) 2*2048kB (UM) 1397*4096kB (M) = 5806356kB [15443.739997] Node 2 Normal: 456*4kB (UME) 307*8kB (UM) 136*16kB (UME) 138*32kB (UME) 44*64kB (UME) 10*128kB (UE) 21*256kB (UM) 12*512 kB (UM) 6*1024kB (UM) 7*2048kB (UME) 163*4096kB (UM) = 714616kB [15443.745261] Node 3 Normal: 463*4kB (UM) 940*8kB (UM) 392*16kB (UME) 210*32kB (UME) 121*64kB (UME) 40*128kB (UME) 2*256kB (ME) 1*512k B (U) 1*1024kB (E) 3*2048kB (UME) 188*4096kB (M) = 813468kB [15443.750265] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [15443.752776] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [15443.755614] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [15443.758256] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [15443.761168] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [15443.763996] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [15443.766634] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [15443.769423] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [15443.772050] 70156 total pagecache pages [15443.773364] 0 pages in swap cache [15443.774535] Swap cache stats: add 0, delete 0, find 0/0 [15443.776213] Free swap = 497976kB [15443.777394] Total swap = 497976kB [15443.778606] 4194165 pages RAM [15443.779551] 0 pages HighMem/MovableOnly [15443.780705] 121078 pages reserved It is node 3 that fsx has been running on and it's the only node that has been leaking dirty pages. Node 3 only has 1GB of pages present, so having 2.8GB of dirty pages on the node is just a little wrong. ZONE_NR_WRITE_PENDING, NR_FILE_DIRTY and NR_DIRTY all match in terms of leaking dirty pages. The reproducer for 5.19-rc4 is running fsx from fstests like so: <create 8GB pmem/ramdisk> mkfs.xfs -f /dev/pmem0 mount /dev/pmem0 /mnt/test src/fstests/ltp/fsx q -N 100000000 -p 1000000 -o 128000 -l 600000 -C -I -X /mnt/test/junk And this will leak around 300-400kB of dirty pages every second. I suspect this is the same problem as the data corruption that Darrick is seeing - that smells of a dirty multi-page folio that hasn't been fully written back. e.g. a multipage folio gets marked dirty so accounts mulitple pages dirty, but only a single page of the folio is written and marked clean, leaking both dirty page accounting and stale data contents on disk that fsx eventually trips over. I can't find a likely cause through reading the code, so maybe someone who knows the mm accounting code better than I do can spot the problem.... Cheers, Dave.
On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote: > So using this technique, I've discovered that there's a dirty page > accounting leak that eventually results in fsx hanging in > balance_dirty_pages(). Alas, I think this is only an accounting error, and not related to the problem(s) that Darrick & Zorro are seeing. I think what you're seeing is dirty pages being dropped at truncation without the appropriate accounting. ie this should be the fix: +++ b/mm/huge_memory.c @@ -2443,6 +2443,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, __delete_from_page_cache(head + i, NULL); if (shmem_mapping(head->mapping)) shmem_uncharge(head->mapping->host, 1); + else + folio_account_cleaned(page_folio(head + i)); put_page(head + i); } else if (!PageAnon(page)) { __xa_store(&head->mapping->i_pages, head[i].index,
On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote: > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote: > > So using this technique, I've discovered that there's a dirty page > > accounting leak that eventually results in fsx hanging in > > balance_dirty_pages(). > > Alas, I think this is only an accounting error, and not related to > the problem(s) that Darrick & Zorro are seeing. I think what you're > seeing is dirty pages being dropped at truncation without the > appropriate accounting. ie this should be the fix: Argh, try one that actually compiles. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f7248002dad9..0553ae90509f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -18,6 +18,7 @@ #include <linux/shrinker.h> #include <linux/mm_inline.h> #include <linux/swapops.h> +#include <linux/backing-dev.h> #include <linux/dax.h> #include <linux/khugepaged.h> #include <linux/freezer.h> @@ -2443,6 +2444,9 @@ static void __split_huge_page(struct page *page, struct list_head *list, __delete_from_page_cache(head + i, NULL); if (shmem_mapping(head->mapping)) shmem_uncharge(head->mapping->host, 1); + else + folio_account_cleaned(page_folio(head + i), + inode_to_wb(folio->mapping->host)); put_page(head + i); } else if (!PageAnon(page)) { __xa_store(&head->mapping->i_pages, head[i].index,
On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote: > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote: > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote: > > > So using this technique, I've discovered that there's a dirty page > > > accounting leak that eventually results in fsx hanging in > > > balance_dirty_pages(). > > > > Alas, I think this is only an accounting error, and not related to > > the problem(s) that Darrick & Zorro are seeing. I think what you're > > seeing is dirty pages being dropped at truncation without the > > appropriate accounting. ie this should be the fix: > > Argh, try one that actually compiles. ... that one's going to underflow the accounting. Maybe I shouldn't be writing code at 6am? diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f7248002dad9..4eec6ee83e44 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -18,6 +18,7 @@ #include <linux/shrinker.h> #include <linux/mm_inline.h> #include <linux/swapops.h> +#include <linux/backing-dev.h> #include <linux/dax.h> #include <linux/khugepaged.h> #include <linux/freezer.h> @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list, __split_huge_page_tail(head, i, lruvec, list); /* Some pages can be beyond EOF: drop them from page cache */ if (head[i].index >= end) { - ClearPageDirty(head + i); - __delete_from_page_cache(head + i, NULL); + struct folio *tail = page_folio(head + i); + if (shmem_mapping(head->mapping)) shmem_uncharge(head->mapping->host, 1); - put_page(head + i); + else if (folio_test_clear_dirty(tail)) + folio_account_cleaned(tail, + inode_to_wb(folio->mapping->host)); + __filemap_remove_folio(tail, NULL); + folio_put(tail); } else if (!PageAnon(page)) { __xa_store(&head->mapping->i_pages, head[i].index, head + i, 0);
On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote: > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote: > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote: > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote: > > > > So using this technique, I've discovered that there's a dirty page > > > > accounting leak that eventually results in fsx hanging in > > > > balance_dirty_pages(). > > > > > > Alas, I think this is only an accounting error, and not related to > > > the problem(s) that Darrick & Zorro are seeing. I think what you're > > > seeing is dirty pages being dropped at truncation without the > > > appropriate accounting. ie this should be the fix: > > > > Argh, try one that actually compiles. > > ... that one's going to underflow the accounting. Maybe I shouldn't > be writing code at 6am? I dunno, it's been running on my test VMs for 160 minutes (same debug setup as yesterday) and 100 minutes (regular g/522, no -C/-I flags to fsx, no debugging junk) and neither have reported corruptions. $ grep Dirty /proc/meminfo Dirty: 100 kB So, pretty good for writing code at 6am while on holiday. I'll let this run overnight, but I think you've fixed the problem, at least for me... --D > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index f7248002dad9..4eec6ee83e44 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -18,6 +18,7 @@ > #include <linux/shrinker.h> > #include <linux/mm_inline.h> > #include <linux/swapops.h> > +#include <linux/backing-dev.h> > #include <linux/dax.h> > #include <linux/khugepaged.h> > #include <linux/freezer.h> > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list, > __split_huge_page_tail(head, i, lruvec, list); > /* Some pages can be beyond EOF: drop them from page cache */ > if (head[i].index >= end) { > - ClearPageDirty(head + i); > - __delete_from_page_cache(head + i, NULL); > + struct folio *tail = page_folio(head + i); > + > if (shmem_mapping(head->mapping)) > shmem_uncharge(head->mapping->host, 1); > - put_page(head + i); > + else if (folio_test_clear_dirty(tail)) > + folio_account_cleaned(tail, > + inode_to_wb(folio->mapping->host)); > + __filemap_remove_folio(tail, NULL); > + folio_put(tail); > } else if (!PageAnon(page)) { > __xa_store(&head->mapping->i_pages, head[i].index, > head + i, 0);
On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote: > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote: > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote: > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote: > > > > So using this technique, I've discovered that there's a dirty page > > > > accounting leak that eventually results in fsx hanging in > > > > balance_dirty_pages(). > > > > > > Alas, I think this is only an accounting error, and not related to > > > the problem(s) that Darrick & Zorro are seeing. I think what you're > > > seeing is dirty pages being dropped at truncation without the > > > appropriate accounting. ie this should be the fix: > > > > Argh, try one that actually compiles. > > ... that one's going to underflow the accounting. Maybe I shouldn't > be writing code at 6am? > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index f7248002dad9..4eec6ee83e44 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -18,6 +18,7 @@ > #include <linux/shrinker.h> > #include <linux/mm_inline.h> > #include <linux/swapops.h> > +#include <linux/backing-dev.h> > #include <linux/dax.h> > #include <linux/khugepaged.h> > #include <linux/freezer.h> > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list, > __split_huge_page_tail(head, i, lruvec, list); > /* Some pages can be beyond EOF: drop them from page cache */ > if (head[i].index >= end) { > - ClearPageDirty(head + i); > - __delete_from_page_cache(head + i, NULL); > + struct folio *tail = page_folio(head + i); > + > if (shmem_mapping(head->mapping)) > shmem_uncharge(head->mapping->host, 1); > - put_page(head + i); > + else if (folio_test_clear_dirty(tail)) > + folio_account_cleaned(tail, > + inode_to_wb(folio->mapping->host)); > + __filemap_remove_folio(tail, NULL); > + folio_put(tail); > } else if (!PageAnon(page)) { > __xa_store(&head->mapping->i_pages, head[i].index, > head + i, 0); > Yup, that fixes the leak. Tested-by: Dave Chinner <dchinner@redhat.com> Cheers, Dave.
On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote: > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote: > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote: > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote: > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote: > > > > > So using this technique, I've discovered that there's a dirty page > > > > > accounting leak that eventually results in fsx hanging in > > > > > balance_dirty_pages(). > > > > > > > > Alas, I think this is only an accounting error, and not related to > > > > the problem(s) that Darrick & Zorro are seeing. I think what you're > > > > seeing is dirty pages being dropped at truncation without the > > > > appropriate accounting. ie this should be the fix: > > > > > > Argh, try one that actually compiles. > > > > ... that one's going to underflow the accounting. Maybe I shouldn't > > be writing code at 6am? > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index f7248002dad9..4eec6ee83e44 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -18,6 +18,7 @@ > > #include <linux/shrinker.h> > > #include <linux/mm_inline.h> > > #include <linux/swapops.h> > > +#include <linux/backing-dev.h> > > #include <linux/dax.h> > > #include <linux/khugepaged.h> > > #include <linux/freezer.h> > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > __split_huge_page_tail(head, i, lruvec, list); > > /* Some pages can be beyond EOF: drop them from page cache */ > > if (head[i].index >= end) { > > - ClearPageDirty(head + i); > > - __delete_from_page_cache(head + i, NULL); > > + struct folio *tail = page_folio(head + i); > > + > > if (shmem_mapping(head->mapping)) > > shmem_uncharge(head->mapping->host, 1); > > - put_page(head + i); > > + else if (folio_test_clear_dirty(tail)) > > + folio_account_cleaned(tail, > > + inode_to_wb(folio->mapping->host)); > > + __filemap_remove_folio(tail, NULL); > > + folio_put(tail); > > } else if (!PageAnon(page)) { > > __xa_store(&head->mapping->i_pages, head[i].index, > > head + i, 0); > > > > Yup, that fixes the leak. > > Tested-by: Dave Chinner <dchinner@redhat.com> Four hours of generic/522 running is long enough to conclude that this is likely the fix for my problem and migrate long soak testing to my main g/522 rig and: Tested-by: Darrick J. Wong <djwong@kernel.org> --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, Jun 28, 2022 at 04:21:55PM -0700, Darrick J. Wong wrote: > On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote: > > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote: > > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote: > > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote: > > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote: > > > > > > So using this technique, I've discovered that there's a dirty page > > > > > > accounting leak that eventually results in fsx hanging in > > > > > > balance_dirty_pages(). > > > > > > > > > > Alas, I think this is only an accounting error, and not related to > > > > > the problem(s) that Darrick & Zorro are seeing. I think what you're > > > > > seeing is dirty pages being dropped at truncation without the > > > > > appropriate accounting. ie this should be the fix: > > > > > > > > Argh, try one that actually compiles. > > > > > > ... that one's going to underflow the accounting. Maybe I shouldn't > > > be writing code at 6am? > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index f7248002dad9..4eec6ee83e44 100644 > > > --- a/mm/huge_memory.c > > > +++ b/mm/huge_memory.c > > > @@ -18,6 +18,7 @@ > > > #include <linux/shrinker.h> > > > #include <linux/mm_inline.h> > > > #include <linux/swapops.h> > > > +#include <linux/backing-dev.h> > > > #include <linux/dax.h> > > > #include <linux/khugepaged.h> > > > #include <linux/freezer.h> > > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > __split_huge_page_tail(head, i, lruvec, list); > > > /* Some pages can be beyond EOF: drop them from page cache */ > > > if (head[i].index >= end) { > > > - ClearPageDirty(head + i); > > > - __delete_from_page_cache(head + i, NULL); > > > + struct folio *tail = page_folio(head + i); > > > + > > > if (shmem_mapping(head->mapping)) > > > shmem_uncharge(head->mapping->host, 1); > > > - put_page(head + i); > > > + else if (folio_test_clear_dirty(tail)) > > > + folio_account_cleaned(tail, > > > + inode_to_wb(folio->mapping->host)); > > > + __filemap_remove_folio(tail, NULL); > > > + folio_put(tail); > > > } else if (!PageAnon(page)) { > > > __xa_store(&head->mapping->i_pages, head[i].index, > > > head + i, 0); > > > > > > > Yup, that fixes the leak. > > > > Tested-by: Dave Chinner <dchinner@redhat.com> > > Four hours of generic/522 running is long enough to conclude that this > is likely the fix for my problem and migrate long soak testing to my > main g/522 rig and: > > Tested-by: Darrick J. Wong <djwong@kernel.org> > Just based on Willy's earlier comment.. what I would probably be a little careful/curious about here is whether the accounting fix leads to an indirect behavior change that does impact reproducibility of the corruption problem. For example, does artificially escalated dirty page tracking lead to increased reclaim/writeback activity than might otherwise occur, and thus contend with the fs workload? Clearly it has some impact based on Dave's balance_dirty_pages() problem reproducer, but I don't know if it extends beyond that off the top of my head. That might make some sense if the workload is fsx, since that doesn't typically stress cache/memory usage the way a large fsstress workload or something might. So for example, interesting questions might be... Do your corruption events happen to correspond with dirty page accounting crossing some threshold based on available memory in your test environment? Does reducing available memory affect reproducibility? Etc. Brian > --D > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com >
On Wed, Jun 29, 2022 at 08:57:30AM -0400, Brian Foster wrote: > On Tue, Jun 28, 2022 at 04:21:55PM -0700, Darrick J. Wong wrote: > > On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote: > > > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote: > > > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote: > > > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote: > > > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote: > > > > > > > So using this technique, I've discovered that there's a dirty page > > > > > > > accounting leak that eventually results in fsx hanging in > > > > > > > balance_dirty_pages(). > > > > > > > > > > > > Alas, I think this is only an accounting error, and not related to > > > > > > the problem(s) that Darrick & Zorro are seeing. I think what you're > > > > > > seeing is dirty pages being dropped at truncation without the > > > > > > appropriate accounting. ie this should be the fix: > > > > > > > > > > Argh, try one that actually compiles. > > > > > > > > ... that one's going to underflow the accounting. Maybe I shouldn't > > > > be writing code at 6am? > > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > index f7248002dad9..4eec6ee83e44 100644 > > > > --- a/mm/huge_memory.c > > > > +++ b/mm/huge_memory.c > > > > @@ -18,6 +18,7 @@ > > > > #include <linux/shrinker.h> > > > > #include <linux/mm_inline.h> > > > > #include <linux/swapops.h> > > > > +#include <linux/backing-dev.h> > > > > #include <linux/dax.h> > > > > #include <linux/khugepaged.h> > > > > #include <linux/freezer.h> > > > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > __split_huge_page_tail(head, i, lruvec, list); > > > > /* Some pages can be beyond EOF: drop them from page cache */ > > > > if (head[i].index >= end) { > > > > - ClearPageDirty(head + i); > > > > - __delete_from_page_cache(head + i, NULL); > > > > + struct folio *tail = page_folio(head + i); > > > > + > > > > if (shmem_mapping(head->mapping)) > > > > shmem_uncharge(head->mapping->host, 1); > > > > - put_page(head + i); > > > > + else if (folio_test_clear_dirty(tail)) > > > > + folio_account_cleaned(tail, > > > > + inode_to_wb(folio->mapping->host)); > > > > + __filemap_remove_folio(tail, NULL); > > > > + folio_put(tail); > > > > } else if (!PageAnon(page)) { > > > > __xa_store(&head->mapping->i_pages, head[i].index, > > > > head + i, 0); > > > > > > > > > > Yup, that fixes the leak. > > > > > > Tested-by: Dave Chinner <dchinner@redhat.com> > > > > Four hours of generic/522 running is long enough to conclude that this > > is likely the fix for my problem and migrate long soak testing to my > > main g/522 rig and: > > > > Tested-by: Darrick J. Wong <djwong@kernel.org> > > > > Just based on Willy's earlier comment.. what I would probably be a > little careful/curious about here is whether the accounting fix leads to > an indirect behavior change that does impact reproducibility of the > corruption problem. For example, does artificially escalated dirty page > tracking lead to increased reclaim/writeback activity than might > otherwise occur, and thus contend with the fs workload? Clearly it has > some impact based on Dave's balance_dirty_pages() problem reproducer, > but I don't know if it extends beyond that off the top of my head. That > might make some sense if the workload is fsx, since that doesn't > typically stress cache/memory usage the way a large fsstress workload or > something might. > > So for example, interesting questions might be... Do your corruption > events happen to correspond with dirty page accounting crossing some > threshold based on available memory in your test environment? Does > reducing available memory affect reproducibility? Etc. Yeah, I wonder that too now. I managed to trace generic/522 a couple of times before willy's patch dropped. From what I could tell, a large folio X would get page P assigned to the fsx file's page cache to cover range R, dirtied, and written to disk. At some point later, we'd reflink into part of the file range adjacent to P, but not P itself. I /think/ that should have caused the whole folio to get invalidated? Then some more things happened (none of which dirtied R, according to fsx) and then suddenly writeback would trigger on some page (don't know which) that would write to the disk blocks backing R. I'm fairly sure that's where the incorrect disk contents came from. Next, we'd reflink part of the file range including R into a different part of the file (call it R2). fsx would read R2, bringing a new page into cache, and it wouldn't match the fsxgood buffer, leading to fsx aborting. After a umount/mount cycle, reading R and R2 would both reveal the incorrect contents that had caused fsx to abort. Unfortunately the second ftrace attempt ate some trace data, so I was unable to figure out if the same thing happened again. At this point I really need to get on reviewing patches for 5.20, so I'll try to keep poking at this (examining the trace data requires a lot of concentration which isn't really possible while sawzall construction is going on at home) but at worst I can ask Linus to merge a patch for 5.19 final that makes setting mapping_set_large_folio a Kconfig/CONFIG_XFS_DEBUG option. --D > > Brian > > > --D > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > >
On Wed, Jun 29, 2022 at 01:22:06PM -0700, Darrick J. Wong wrote: > On Wed, Jun 29, 2022 at 08:57:30AM -0400, Brian Foster wrote: > > On Tue, Jun 28, 2022 at 04:21:55PM -0700, Darrick J. Wong wrote: > > > On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote: > > > > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote: > > > > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote: > > > > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote: > > > > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote: > > > > > > > > So using this technique, I've discovered that there's a dirty page > > > > > > > > accounting leak that eventually results in fsx hanging in > > > > > > > > balance_dirty_pages(). > > > > > > > > > > > > > > Alas, I think this is only an accounting error, and not related to > > > > > > > the problem(s) that Darrick & Zorro are seeing. I think what you're > > > > > > > seeing is dirty pages being dropped at truncation without the > > > > > > > appropriate accounting. ie this should be the fix: > > > > > > > > > > > > Argh, try one that actually compiles. > > > > > > > > > > ... that one's going to underflow the accounting. Maybe I shouldn't > > > > > be writing code at 6am? > > > > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > index f7248002dad9..4eec6ee83e44 100644 > > > > > --- a/mm/huge_memory.c > > > > > +++ b/mm/huge_memory.c > > > > > @@ -18,6 +18,7 @@ > > > > > #include <linux/shrinker.h> > > > > > #include <linux/mm_inline.h> > > > > > #include <linux/swapops.h> > > > > > +#include <linux/backing-dev.h> > > > > > #include <linux/dax.h> > > > > > #include <linux/khugepaged.h> > > > > > #include <linux/freezer.h> > > > > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > __split_huge_page_tail(head, i, lruvec, list); > > > > > /* Some pages can be beyond EOF: drop them from page cache */ > > > > > if (head[i].index >= end) { > > > > > - ClearPageDirty(head + i); > > > > > - __delete_from_page_cache(head + i, NULL); > > > > > + struct folio *tail = page_folio(head + i); > > > > > + > > > > > if (shmem_mapping(head->mapping)) > > > > > shmem_uncharge(head->mapping->host, 1); > > > > > - put_page(head + i); > > > > > + else if (folio_test_clear_dirty(tail)) > > > > > + folio_account_cleaned(tail, > > > > > + inode_to_wb(folio->mapping->host)); > > > > > + __filemap_remove_folio(tail, NULL); > > > > > + folio_put(tail); > > > > > } else if (!PageAnon(page)) { > > > > > __xa_store(&head->mapping->i_pages, head[i].index, > > > > > head + i, 0); > > > > > > > > > > > > > Yup, that fixes the leak. > > > > > > > > Tested-by: Dave Chinner <dchinner@redhat.com> > > > > > > Four hours of generic/522 running is long enough to conclude that this > > > is likely the fix for my problem and migrate long soak testing to my > > > main g/522 rig and: > > > > > > Tested-by: Darrick J. Wong <djwong@kernel.org> > > > > > > > Just based on Willy's earlier comment.. what I would probably be a > > little careful/curious about here is whether the accounting fix leads to > > an indirect behavior change that does impact reproducibility of the > > corruption problem. For example, does artificially escalated dirty page > > tracking lead to increased reclaim/writeback activity than might > > otherwise occur, and thus contend with the fs workload? Clearly it has > > some impact based on Dave's balance_dirty_pages() problem reproducer, > > but I don't know if it extends beyond that off the top of my head. That > > might make some sense if the workload is fsx, since that doesn't > > typically stress cache/memory usage the way a large fsstress workload or > > something might. > > > > So for example, interesting questions might be... Do your corruption > > events happen to correspond with dirty page accounting crossing some > > threshold based on available memory in your test environment? Does > > reducing available memory affect reproducibility? Etc. > > Yeah, I wonder that too now. I managed to trace generic/522 a couple of > times before willy's patch dropped. From what I could tell, a large > folio X would get page P assigned to the fsx file's page cache to cover > range R, dirtied, and written to disk. At some point later, we'd > reflink into part of the file range adjacent to P, but not P itself. > I /think/ that should have caused the whole folio to get invalidated? > > Then some more things happened (none of which dirtied R, according to > fsx) and then suddenly writeback would trigger on some page (don't know > which) that would write to the disk blocks backing R. I'm fairly sure > that's where the incorrect disk contents came from. > > Next, we'd reflink part of the file range including R into a different > part of the file (call it R2). fsx would read R2, bringing a new page > into cache, and it wouldn't match the fsxgood buffer, leading to fsx > aborting. > > After a umount/mount cycle, reading R and R2 would both reveal the > incorrect contents that had caused fsx to abort. > FWIW, I hadn't been able to reproduce this in my default environment to this point. With the memory leak issue in the light, I was eventually able to by reducing dirty_bytes to something the system would be more likely to hit sooner (i.e. 16-32MB), but I also see stalling behavior and whatnot due to the leak that requires backing off from the specified dirty limit every so often. If I apply the accounting patch to avoid the leak and set dirty_background_bytes to something notably aggressive (1kB), the test survived 100 iterations or so before I stopped it. If I then set dirty_bytes to something similarly aggressive (1MB), I hit the failure on the next iteration (assuming it's the same problem). It's spinning again at ~25 or so iterations without a failure so far, so I'd have to wait and see how reliable the reproducer really is. Though if it doesn't reoccur soonish, perhaps I'll try reducing dirty_bytes a bit more... My suspicion based on these characteristics would be that the blocking limit triggers more aggressive reclaim/invalidation, and thus helps detect the problem sooner. If reflink is involved purely as a cache invalidation step (i.e. so a subsequent read will hit the disk and detect a cache inconsistency), then it might be interesting to see if it can still be reproduced without reflink operations enabled but instead with some combination of the -f/-X fsx flags to perform more flush invals and on-disk data checks.. Brian > Unfortunately the second ftrace attempt ate some trace data, so I was > unable to figure out if the same thing happened again. > > At this point I really need to get on reviewing patches for 5.20, so > I'll try to keep poking at this (examining the trace data requires a lot > of concentration which isn't really possible while sawzall construction > is going on at home) but at worst I can ask Linus to merge a patch for > 5.19 final that makes setting mapping_set_large_folio a > Kconfig/CONFIG_XFS_DEBUG option. > > --D > > > > > Brian > > > > > --D > > > > > > > Cheers, > > > > > > > > Dave. > > > > -- > > > > Dave Chinner > > > > david@fromorbit.com > > > > > >
On Fri, Jul 01, 2022 at 12:03:23PM -0400, Brian Foster wrote: > On Wed, Jun 29, 2022 at 01:22:06PM -0700, Darrick J. Wong wrote: > > On Wed, Jun 29, 2022 at 08:57:30AM -0400, Brian Foster wrote: > > > On Tue, Jun 28, 2022 at 04:21:55PM -0700, Darrick J. Wong wrote: > > > > On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote: > > > > > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote: > > > > > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote: > > > > > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote: > > > > > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote: > > > > > > > > > So using this technique, I've discovered that there's a dirty page > > > > > > > > > accounting leak that eventually results in fsx hanging in > > > > > > > > > balance_dirty_pages(). > > > > > > > > > > > > > > > > Alas, I think this is only an accounting error, and not related to > > > > > > > > the problem(s) that Darrick & Zorro are seeing. I think what you're > > > > > > > > seeing is dirty pages being dropped at truncation without the > > > > > > > > appropriate accounting. ie this should be the fix: > > > > > > > > > > > > > > Argh, try one that actually compiles. > > > > > > > > > > > > ... that one's going to underflow the accounting. Maybe I shouldn't > > > > > > be writing code at 6am? > > > > > > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > > index f7248002dad9..4eec6ee83e44 100644 > > > > > > --- a/mm/huge_memory.c > > > > > > +++ b/mm/huge_memory.c > > > > > > @@ -18,6 +18,7 @@ > > > > > > #include <linux/shrinker.h> > > > > > > #include <linux/mm_inline.h> > > > > > > #include <linux/swapops.h> > > > > > > +#include <linux/backing-dev.h> > > > > > > #include <linux/dax.h> > > > > > > #include <linux/khugepaged.h> > > > > > > #include <linux/freezer.h> > > > > > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > __split_huge_page_tail(head, i, lruvec, list); > > > > > > /* Some pages can be beyond EOF: drop them from page cache */ > > > > > > if (head[i].index >= end) { > > > > > > - ClearPageDirty(head + i); > > > > > > - __delete_from_page_cache(head + i, NULL); > > > > > > + struct folio *tail = page_folio(head + i); > > > > > > + > > > > > > if (shmem_mapping(head->mapping)) > > > > > > shmem_uncharge(head->mapping->host, 1); > > > > > > - put_page(head + i); > > > > > > + else if (folio_test_clear_dirty(tail)) > > > > > > + folio_account_cleaned(tail, > > > > > > + inode_to_wb(folio->mapping->host)); > > > > > > + __filemap_remove_folio(tail, NULL); > > > > > > + folio_put(tail); > > > > > > } else if (!PageAnon(page)) { > > > > > > __xa_store(&head->mapping->i_pages, head[i].index, > > > > > > head + i, 0); > > > > > > > > > > > > > > > > Yup, that fixes the leak. > > > > > > > > > > Tested-by: Dave Chinner <dchinner@redhat.com> > > > > > > > > Four hours of generic/522 running is long enough to conclude that this > > > > is likely the fix for my problem and migrate long soak testing to my > > > > main g/522 rig and: > > > > > > > > Tested-by: Darrick J. Wong <djwong@kernel.org> > > > > > > > > > > Just based on Willy's earlier comment.. what I would probably be a > > > little careful/curious about here is whether the accounting fix leads to > > > an indirect behavior change that does impact reproducibility of the > > > corruption problem. For example, does artificially escalated dirty page > > > tracking lead to increased reclaim/writeback activity than might > > > otherwise occur, and thus contend with the fs workload? Clearly it has > > > some impact based on Dave's balance_dirty_pages() problem reproducer, > > > but I don't know if it extends beyond that off the top of my head. That > > > might make some sense if the workload is fsx, since that doesn't > > > typically stress cache/memory usage the way a large fsstress workload or > > > something might. > > > > > > So for example, interesting questions might be... Do your corruption > > > events happen to correspond with dirty page accounting crossing some > > > threshold based on available memory in your test environment? Does > > > reducing available memory affect reproducibility? Etc. > > > > Yeah, I wonder that too now. I managed to trace generic/522 a couple of > > times before willy's patch dropped. From what I could tell, a large > > folio X would get page P assigned to the fsx file's page cache to cover > > range R, dirtied, and written to disk. At some point later, we'd > > reflink into part of the file range adjacent to P, but not P itself. > > I /think/ that should have caused the whole folio to get invalidated? > > > > Then some more things happened (none of which dirtied R, according to > > fsx) and then suddenly writeback would trigger on some page (don't know > > which) that would write to the disk blocks backing R. I'm fairly sure > > that's where the incorrect disk contents came from. > > > > Next, we'd reflink part of the file range including R into a different > > part of the file (call it R2). fsx would read R2, bringing a new page > > into cache, and it wouldn't match the fsxgood buffer, leading to fsx > > aborting. > > > > After a umount/mount cycle, reading R and R2 would both reveal the > > incorrect contents that had caused fsx to abort. > > > > FWIW, I hadn't been able to reproduce this in my default environment to > this point. With the memory leak issue in the light, I was eventually > able to by reducing dirty_bytes to something the system would be more > likely to hit sooner (i.e. 16-32MB), but I also see stalling behavior > and whatnot due to the leak that requires backing off from the specified > dirty limit every so often. > > If I apply the accounting patch to avoid the leak and set > dirty_background_bytes to something notably aggressive (1kB), the test > survived 100 iterations or so before I stopped it. If I then set > dirty_bytes to something similarly aggressive (1MB), I hit the failure > on the next iteration (assuming it's the same problem). It's spinning > again at ~25 or so iterations without a failure so far, so I'd have to > wait and see how reliable the reproducer really is. Though if it doesn't > reoccur soonish, perhaps I'll try reducing dirty_bytes a bit more... > > My suspicion based on these characteristics would be that the blocking > limit triggers more aggressive reclaim/invalidation, and thus helps > detect the problem sooner. If reflink is involved purely as a cache > invalidation step (i.e. so a subsequent read will hit the disk and > detect a cache inconsistency), then it might be interesting to see if it > can still be reproduced without reflink operations enabled but instead > with some combination of the -f/-X fsx flags to perform more flush > invals and on-disk data checks.. Hm. I didn't try -f or lowering dirty_bytes, but with the reflink operations disabled, g522 ran for 3 hours before I gave up and killed it. I would've thought that the fallocate zero/collapse/insert range functions (which use the same flush/unmap helper) would have sufficed to make the problem happen, but ... it didn't. I think I'll try changing dirty_bytes next, to see if I can reproduce the problem that way. I'm not surprised that you had to set dirty_bytes to 1MB, since 522 is only ever creating a 600K file anyway. (Hopefully willy will be back next week to help us shed some light on this.) --D > Brian > > > Unfortunately the second ftrace attempt ate some trace data, so I was > > unable to figure out if the same thing happened again. > > > > At this point I really need to get on reviewing patches for 5.20, so > > I'll try to keep poking at this (examining the trace data requires a lot > > of concentration which isn't really possible while sawzall construction > > is going on at home) but at worst I can ask Linus to merge a patch for > > 5.19 final that makes setting mapping_set_large_folio a > > Kconfig/CONFIG_XFS_DEBUG option. > > > > --D > > > > > > > > Brian > > > > > > > --D > > > > > > > > > Cheers, > > > > > > > > > > Dave. > > > > > -- > > > > > Dave Chinner > > > > > david@fromorbit.com > > > > > > > > > >
On Wed, Jun 29, 2022 at 08:57:30AM -0400, Brian Foster wrote: > On Tue, Jun 28, 2022 at 04:21:55PM -0700, Darrick J. Wong wrote: > > On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote: > > > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote: > > > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote: > > > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote: > > > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote: > > > > > > > So using this technique, I've discovered that there's a dirty page > > > > > > > accounting leak that eventually results in fsx hanging in > > > > > > > balance_dirty_pages(). > > > > > > > > > > > > Alas, I think this is only an accounting error, and not related to > > > > > > the problem(s) that Darrick & Zorro are seeing. I think what you're > > > > > > seeing is dirty pages being dropped at truncation without the > > > > > > appropriate accounting. ie this should be the fix: > > > > > > > > > > Argh, try one that actually compiles. > > > > > > > > ... that one's going to underflow the accounting. Maybe I shouldn't > > > > be writing code at 6am? > > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > index f7248002dad9..4eec6ee83e44 100644 > > > > --- a/mm/huge_memory.c > > > > +++ b/mm/huge_memory.c > > > > @@ -18,6 +18,7 @@ > > > > #include <linux/shrinker.h> > > > > #include <linux/mm_inline.h> > > > > #include <linux/swapops.h> > > > > +#include <linux/backing-dev.h> > > > > #include <linux/dax.h> > > > > #include <linux/khugepaged.h> > > > > #include <linux/freezer.h> > > > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > __split_huge_page_tail(head, i, lruvec, list); > > > > /* Some pages can be beyond EOF: drop them from page cache */ > > > > if (head[i].index >= end) { > > > > - ClearPageDirty(head + i); > > > > - __delete_from_page_cache(head + i, NULL); > > > > + struct folio *tail = page_folio(head + i); > > > > + > > > > if (shmem_mapping(head->mapping)) > > > > shmem_uncharge(head->mapping->host, 1); > > > > - put_page(head + i); > > > > + else if (folio_test_clear_dirty(tail)) > > > > + folio_account_cleaned(tail, > > > > + inode_to_wb(folio->mapping->host)); > > > > + __filemap_remove_folio(tail, NULL); > > > > + folio_put(tail); > > > > } else if (!PageAnon(page)) { > > > > __xa_store(&head->mapping->i_pages, head[i].index, > > > > head + i, 0); > > > > > > > > > > Yup, that fixes the leak. > > > > > > Tested-by: Dave Chinner <dchinner@redhat.com> > > > > Four hours of generic/522 running is long enough to conclude that this > > is likely the fix for my problem and migrate long soak testing to my > > main g/522 rig and: > > > > Tested-by: Darrick J. Wong <djwong@kernel.org> > > > > Just based on Willy's earlier comment.. what I would probably be a > little careful/curious about here is whether the accounting fix leads to > an indirect behavior change that does impact reproducibility of the > corruption problem. For example, does artificially escalated dirty page > tracking lead to increased reclaim/writeback activity than might > otherwise occur, and thus contend with the fs workload? Yeah, I think you're right, Brian. There is an underlying problem here, and the writeback/memory pressure bugs exposed it for everyone to see. IOWs, we do indeed have a pre-existing problem with partial writes, iomap, unwritten extents, writeback and memory reclaim. (Please s/page/folio/ rather than complain I'm calling things pages and not folios, I'm juggling between RHEL-8 code and upstream here and it's just easier to refer to everything as pages as I write it.) (I'm assuming people are familiar with the iomap buffered IO code at this point.) Let's do a partial write into a page. Let's say the second half of the page. __iomap_begin_write() drops into iomap_adjust_read_range() to determine the range that needs zeroing or reading from disk. We then call iomap_block_needs_zeroing() to determine if we zero - which we do if it's a newly allocated or unwritten extent. Otherwise we read it from disk. Let's put this partial write over a large unwritten extent, which means we zero the first part of the page, and that writing it back runs unwritten extent conversion at IO completion. This changes the extent backing this file offset from unwritten to written. Then, add extreme memory pressure, so writeback is already running and soon after the page has been dirtied it gets marked for cleaning. Therefore, we have the possibility of background IO submission and memory reclaim occurring at the same time as new writes are occurring to the file. To that end, let's add a large racing write that *ends* on this same page that we have already partially filled with data - it will fill the start of the page that currently contains zeros with real data. Let's start that write() just before writeback of the already uptodate partial page starts. iomap_apply will map the entire write (and well beyond) as a single unwritten extent. This extent is the same extent that the original page in the page cache already covers, and this write would end by filling the remaining part of that write. While that second write is running around in the page cache copy-in loop for the unwritten extent that backs that range, IO submission runs and completes and converts the original partially filled page to a written extent, then marks it clean. [ Ayup, the extent map of the file just changed to be different to the iomap that the current write() thinks the file layout has. ] Further, because the partially written page is now clean, memory reclaim can snaffle it up and it gets removed from the page cache. The second write finally gets to the last page of it's write and doesn't find it in the page cache because memory reclaim removed it. So it pulls a new page into the page cache in iomap_begin_write(), and then in __iomap_begin_write we see that the we need to fill the second half of the page with either zeros or data from disk. We know that this data is on disk as the extent is now in written state. *However*, the cached iomap that we are currently using for the write() says the range is -unwritten-, and so at that point iomap_block_needs_zeroing() says "page needs zeroing". Hence __iomap_begin_write zeroes the second half of the page instead of reading it from disk, obliterating the data that the previous write had already written there. Ouchy, we have data corruption because the incoming write holds a stale cached iomap...... At this point, I had a nasty feeling of deja vu. Oh, yeah, we already infrastructure in place to avoid using stale cached iomaps in ->writepages.... That is, iomap_writepage_map() calls ->map_block for every filesytem block, even though the iomap_writepage_ctx has a cached iomap in it. That's so the filesystem can check that the cached iomap is still valid whilst the page we are writing back is held locked. This protects writeback against races with truncate, hole punch, etc. XFS does this via a call to xfs_imap_valid() - we keep a generation number in the XFS extent trees that is bumped on every change so that we can check whether a cached iomap is still valid. If the generation numbers don't match, then we generate a new iomap for the writepage context.... What is clear to me now is that we have the same stale cached iomap problem with iomap_apply() because writeback can change the extent map state in a way that influence incoming writes and those updates are not in any way serialised against the incoming write IO path. Hence we have to validate that the cached iomap is not stale before we decide what to do with a new page in the page cache (zero or read from disk), and we have to do this iomap validation when the newly instantiated page is locked and ready for initialisation. I have not thought past this point - I'm only *just* smart enough to be able to dig out the root cause of this problem. No scratch that, I'm not smart, just stubborn enough to ignore the fact I've got the analysis wrong at least a dozen times already. That said, this stale mapping problem existed long before we started using unwritten extents for delayed allocation. The change to use unwritten extents (to fix other data corruption issues) exposed us to in-memory zeroing in this race condition case rather reading the block of data from disk. We just hadn't armed the landmine.... I suspect we are going to need a new iomap validation operation, and require the iomap actors to be able to handle iomap validation failure. What that looks like, I don't know as my brain turned to goo and dribbled out my ears about 4 hours ago.... BTW, all the credit for finding this goes to Frank Sorenson - he did all the hard work of reproducing the issue and narrowing down the scope of the problem to a handful of XFS commits combined with writeback and memory reclaim interactions. Without his hard work, we'd still have no idea what was going on..... Cheers, Dave.
On Wed, Aug 17, 2022 at 07:36:27PM +1000, Dave Chinner wrote: > On Wed, Jun 29, 2022 at 08:57:30AM -0400, Brian Foster wrote: > > On Tue, Jun 28, 2022 at 04:21:55PM -0700, Darrick J. Wong wrote: > > > On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote: > > > > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote: > > > > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote: > > > > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote: > > > > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote: > > > > > > > > So using this technique, I've discovered that there's a dirty page > > > > > > > > accounting leak that eventually results in fsx hanging in > > > > > > > > balance_dirty_pages(). > > > > > > > > > > > > > > Alas, I think this is only an accounting error, and not related to > > > > > > > the problem(s) that Darrick & Zorro are seeing. I think what you're > > > > > > > seeing is dirty pages being dropped at truncation without the > > > > > > > appropriate accounting. ie this should be the fix: > > > > > > > > > > > > Argh, try one that actually compiles. > > > > > > > > > > ... that one's going to underflow the accounting. Maybe I shouldn't > > > > > be writing code at 6am? > > > > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > index f7248002dad9..4eec6ee83e44 100644 > > > > > --- a/mm/huge_memory.c > > > > > +++ b/mm/huge_memory.c > > > > > @@ -18,6 +18,7 @@ > > > > > #include <linux/shrinker.h> > > > > > #include <linux/mm_inline.h> > > > > > #include <linux/swapops.h> > > > > > +#include <linux/backing-dev.h> > > > > > #include <linux/dax.h> > > > > > #include <linux/khugepaged.h> > > > > > #include <linux/freezer.h> > > > > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > __split_huge_page_tail(head, i, lruvec, list); > > > > > /* Some pages can be beyond EOF: drop them from page cache */ > > > > > if (head[i].index >= end) { > > > > > - ClearPageDirty(head + i); > > > > > - __delete_from_page_cache(head + i, NULL); > > > > > + struct folio *tail = page_folio(head + i); > > > > > + > > > > > if (shmem_mapping(head->mapping)) > > > > > shmem_uncharge(head->mapping->host, 1); > > > > > - put_page(head + i); > > > > > + else if (folio_test_clear_dirty(tail)) > > > > > + folio_account_cleaned(tail, > > > > > + inode_to_wb(folio->mapping->host)); > > > > > + __filemap_remove_folio(tail, NULL); > > > > > + folio_put(tail); > > > > > } else if (!PageAnon(page)) { > > > > > __xa_store(&head->mapping->i_pages, head[i].index, > > > > > head + i, 0); > > > > > > > > > > > > > Yup, that fixes the leak. > > > > > > > > Tested-by: Dave Chinner <dchinner@redhat.com> > > > > > > Four hours of generic/522 running is long enough to conclude that this > > > is likely the fix for my problem and migrate long soak testing to my > > > main g/522 rig and: > > > > > > Tested-by: Darrick J. Wong <djwong@kernel.org> > > > > > > > Just based on Willy's earlier comment.. what I would probably be a > > little careful/curious about here is whether the accounting fix leads to > > an indirect behavior change that does impact reproducibility of the > > corruption problem. For example, does artificially escalated dirty page > > tracking lead to increased reclaim/writeback activity than might > > otherwise occur, and thus contend with the fs workload? > > Yeah, I think you're right, Brian. There is an underlying problem > here, and the writeback/memory pressure bugs exposed it for everyone > to see. IOWs, we do indeed have a pre-existing problem with partial > writes, iomap, unwritten extents, writeback and memory reclaim. > > (Please s/page/folio/ rather than complain I'm calling things pages > and not folios, I'm juggling between RHEL-8 code and upstream > here and it's just easier to refer to everything as pages as I write > it.) > > (I'm assuming people are familiar with the iomap buffered IO code > at this point.) > > Let's do a partial write into a page. Let's say the second half of > the page. __iomap_begin_write() drops into iomap_adjust_read_range() > to determine the range that needs zeroing or reading from disk. We > then call iomap_block_needs_zeroing() to determine if we zero - > which we do if it's a newly allocated or unwritten extent. Otherwise > we read it from disk. > > Let's put this partial write over a large unwritten extent, which > means we zero the first part of the page, and that writing it back > runs unwritten extent conversion at IO completion. This changes the > extent backing this file offset from unwritten to written. > > Then, add extreme memory pressure, so writeback is already running > and soon after the page has been dirtied it gets marked for > cleaning. Therefore, we have the possibility of background IO > submission and memory reclaim occurring at the same time as new > writes are occurring to the file. > > To that end, let's add a large racing write that *ends* on this same > page that we have already partially filled with data - it will fill > the start of the page that currently contains zeros with real data. > > Let's start that write() just before writeback of the already > uptodate partial page starts. iomap_apply will map the entire > write (and well beyond) as a single unwritten extent. This extent is > the same extent that the original page in the page cache already > covers, and this write would end by filling the remaining part of > that write. > > While that second write is running around in the page cache > copy-in loop for the unwritten extent that backs that range, IO > submission runs and completes and converts the original partially > filled page to a written extent, then marks it clean. So if I understand this correctly -- Thread 1 fallocates 1G of contiguous space and writes 38 bytes to the end of that gigabyte. We instantiate a folio X for the region, zero the parts we're not going to write, and copy the bytes we did want to write into folio X. The folio is now uptodate and dirty. Thread 1 goes away. Thread 2 starts a (1G-38) write of that fallocated (and mostly unwritten) space. It proceeds very slowly, one page at a time; it'll be a while before it gets to folio X. ramsnake.exe fires up and pushes the system hard into memory reclaim. Reclaim initiates dirty writeback of folio X, which writes it to disk and does the unwritten extent conversion. Reclaim then removes X from the page cache, and frees X. Thread 2 finally gets to the end of its 1G write. It is still using the cached 1G unwritten mapping from before, unaware that some of the mapping has changed to written state. It grabs a folio for (1G - 38), but it gets folio Y instead of folio X. Y is not uptodate, so thread 2 copies bytes from userspace and zeroes the last 38 bytes. **WRONG** We now have folio Y with zeroes in the last 38 bytes, instead of whatever thread 1 wrote. Writeback pushes Y to disk, clobbering those 38 bytes forever. > [ Ayup, the extent map of the file just changed to be different > to the iomap that the current write() thinks the file layout has. ] > > Further, because the partially written page is now clean, > memory reclaim can snaffle it up and it gets removed from the page > cache. > > The second write finally gets to the last page of it's write and > doesn't find it in the page cache because memory reclaim removed it. > So it pulls a new page into the page cache in iomap_begin_write(), > and then in __iomap_begin_write we see that the we need to fill the > second half of the page with either zeros or data from disk. > > We know that this data is on disk as the extent is now in written > state. *However*, the cached iomap that we are currently using for > the write() says the range is -unwritten-, and so at that point > iomap_block_needs_zeroing() says "page needs zeroing". Hence > __iomap_begin_write zeroes the second half of the page instead of > reading it from disk, obliterating the data that the previous write > had already written there. Yup. > Ouchy, we have data corruption because the incoming write holds a > stale cached iomap...... > > At this point, I had a nasty feeling of deja vu. > > Oh, yeah, we already infrastructure in place to avoid using stale > cached iomaps in ->writepages.... > > That is, iomap_writepage_map() calls ->map_block for every filesytem > block, even though the iomap_writepage_ctx has a cached iomap in it. > That's so the filesystem can check that the cached iomap is still > valid whilst the page we are writing back is held locked. This > protects writeback against races with truncate, hole punch, etc. > > XFS does this via a call to xfs_imap_valid() - we keep a generation > number in the XFS extent trees that is bumped on every change so > that we can check whether a cached iomap is still valid. If the > generation numbers don't match, then we generate a new iomap for the > writepage context.... > > What is clear to me now is that we have the same stale cached iomap > problem with iomap_apply() because writeback can change the extent > map state in a way that influence incoming writes and those updates > are not in any way serialised against the incoming write IO path. > > Hence we have to validate that the cached iomap is not stale before > we decide what to do with a new page in the page cache (zero or read > from disk), and we have to do this iomap validation when the newly > instantiated page is locked and ready for initialisation. > > I have not thought past this point - I'm only *just* smart enough to > be able to dig out the root cause of this problem. No scratch that, > I'm not smart, just stubborn enough to ignore the fact I've got the > analysis wrong at least a dozen times already. > > That said, this stale mapping problem existed long before we started > using unwritten extents for delayed allocation. The change to use > unwritten extents (to fix other data corruption issues) exposed us > to in-memory zeroing in this race condition case rather reading the > block of data from disk. We just hadn't armed the landmine.... > > I suspect we are going to need a new iomap validation operation, and > require the iomap actors to be able to handle iomap validation > failure. What that looks like, I don't know as my brain > turned to goo and dribbled out my ears about 4 hours ago.... Euughg. I suppose this means that ->iomap_begin has to be able to stuff xfs_ifork.if_seq into a cookie value that we can set in struct iomap somewhere? And then iomap has to be able to call back to XFS to revalidate iomap.cookie? And if iomap.cookie != xfs_ifork.if_seq, then I guess iomap will just ... unlock and put the folio, and return EAGAIN (or something) so that its caller will return however many bytes have been dirtied to iomap_apply so that _apply will go get a fresh mapping? > BTW, all the credit for finding this goes to Frank Sorenson - he did > all the hard work of reproducing the issue and narrowing down the > scope of the problem to a handful of XFS commits combined with > writeback and memory reclaim interactions. Without his hard work, > we'd still have no idea what was going on..... Thanks a bunch to him!!! --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, Aug 17, 2022 at 04:53:31PM -0700, Darrick J. Wong wrote: > On Wed, Aug 17, 2022 at 07:36:27PM +1000, Dave Chinner wrote: > > On Wed, Jun 29, 2022 at 08:57:30AM -0400, Brian Foster wrote: > > > Just based on Willy's earlier comment.. what I would probably be a > > > little careful/curious about here is whether the accounting fix leads to > > > an indirect behavior change that does impact reproducibility of the > > > corruption problem. For example, does artificially escalated dirty page > > > tracking lead to increased reclaim/writeback activity than might > > > otherwise occur, and thus contend with the fs workload? > > > > Yeah, I think you're right, Brian. There is an underlying problem > > here, and the writeback/memory pressure bugs exposed it for everyone > > to see. IOWs, we do indeed have a pre-existing problem with partial > > writes, iomap, unwritten extents, writeback and memory reclaim. > > > > (Please s/page/folio/ rather than complain I'm calling things pages > > and not folios, I'm juggling between RHEL-8 code and upstream > > here and it's just easier to refer to everything as pages as I write > > it.) > > > > (I'm assuming people are familiar with the iomap buffered IO code > > at this point.) > > > > Let's do a partial write into a page. Let's say the second half of > > the page. __iomap_begin_write() drops into iomap_adjust_read_range() > > to determine the range that needs zeroing or reading from disk. We > > then call iomap_block_needs_zeroing() to determine if we zero - > > which we do if it's a newly allocated or unwritten extent. Otherwise > > we read it from disk. > > > > Let's put this partial write over a large unwritten extent, which > > means we zero the first part of the page, and that writing it back > > runs unwritten extent conversion at IO completion. This changes the > > extent backing this file offset from unwritten to written. > > > > Then, add extreme memory pressure, so writeback is already running > > and soon after the page has been dirtied it gets marked for > > cleaning. Therefore, we have the possibility of background IO > > submission and memory reclaim occurring at the same time as new > > writes are occurring to the file. > > > > To that end, let's add a large racing write that *ends* on this same > > page that we have already partially filled with data - it will fill > > the start of the page that currently contains zeros with real data. > > > > Let's start that write() just before writeback of the already > > uptodate partial page starts. iomap_apply will map the entire > > write (and well beyond) as a single unwritten extent. This extent is > > the same extent that the original page in the page cache already > > covers, and this write would end by filling the remaining part of > > that write. > > > > While that second write is running around in the page cache > > copy-in loop for the unwritten extent that backs that range, IO > > submission runs and completes and converts the original partially > > filled page to a written extent, then marks it clean. > > So if I understand this correctly -- > > Thread 1 fallocates 1G of contiguous space and writes 38 bytes to the > end of that gigabyte. We instantiate a folio X for the region, zero the > parts we're not going to write, and copy the bytes we did want to write > into folio X. The folio is now uptodate and dirty. Thread 1 goes away. > > Thread 2 starts a (1G-38) write of that fallocated (and mostly > unwritten) space. It proceeds very slowly, one page at a time; it'll be > a while before it gets to folio X. > > ramsnake.exe fires up and pushes the system hard into memory reclaim. > Reclaim initiates dirty writeback of folio X, which writes it to disk > and does the unwritten extent conversion. Reclaim then removes X from > the page cache, and frees X. > > Thread 2 finally gets to the end of its 1G write. It is still using the > cached 1G unwritten mapping from before, unaware that some of the > mapping has changed to written state. It grabs a folio for (1G - 38), > but it gets folio Y instead of folio X. Y is not uptodate, so thread 2 > copies bytes from userspace and zeroes the last 38 bytes. > > **WRONG** > > We now have folio Y with zeroes in the last 38 bytes, instead of > whatever thread 1 wrote. Writeback pushes Y to disk, clobbering those > 38 bytes forever. Yes, that's pretty much it. > > [ Ayup, the extent map of the file just changed to be different > > to the iomap that the current write() thinks the file layout has. ] > > > > Further, because the partially written page is now clean, > > memory reclaim can snaffle it up and it gets removed from the page > > cache. > > > > The second write finally gets to the last page of it's write and > > doesn't find it in the page cache because memory reclaim removed it. > > So it pulls a new page into the page cache in iomap_begin_write(), > > and then in __iomap_begin_write we see that the we need to fill the > > second half of the page with either zeros or data from disk. > > > > We know that this data is on disk as the extent is now in written > > state. *However*, the cached iomap that we are currently using for > > the write() says the range is -unwritten-, and so at that point > > iomap_block_needs_zeroing() says "page needs zeroing". Hence > > __iomap_begin_write zeroes the second half of the page instead of > > reading it from disk, obliterating the data that the previous write > > had already written there. > > Yup. > > > Ouchy, we have data corruption because the incoming write holds a > > stale cached iomap...... > > > > At this point, I had a nasty feeling of deja vu. > > > > Oh, yeah, we already infrastructure in place to avoid using stale > > cached iomaps in ->writepages.... > > > > That is, iomap_writepage_map() calls ->map_block for every filesytem > > block, even though the iomap_writepage_ctx has a cached iomap in it. > > That's so the filesystem can check that the cached iomap is still > > valid whilst the page we are writing back is held locked. This > > protects writeback against races with truncate, hole punch, etc. > > > > XFS does this via a call to xfs_imap_valid() - we keep a generation > > number in the XFS extent trees that is bumped on every change so > > that we can check whether a cached iomap is still valid. If the > > generation numbers don't match, then we generate a new iomap for the > > writepage context.... > > > > What is clear to me now is that we have the same stale cached iomap > > problem with iomap_apply() because writeback can change the extent > > map state in a way that influence incoming writes and those updates > > are not in any way serialised against the incoming write IO path. > > > > Hence we have to validate that the cached iomap is not stale before > > we decide what to do with a new page in the page cache (zero or read > > from disk), and we have to do this iomap validation when the newly > > instantiated page is locked and ready for initialisation. > > > > I have not thought past this point - I'm only *just* smart enough to > > be able to dig out the root cause of this problem. No scratch that, > > I'm not smart, just stubborn enough to ignore the fact I've got the > > analysis wrong at least a dozen times already. > > > > That said, this stale mapping problem existed long before we started > > using unwritten extents for delayed allocation. The change to use > > unwritten extents (to fix other data corruption issues) exposed us > > to in-memory zeroing in this race condition case rather reading the > > block of data from disk. We just hadn't armed the landmine.... > > > > I suspect we are going to need a new iomap validation operation, and > > require the iomap actors to be able to handle iomap validation > > failure. What that looks like, I don't know as my brain > > turned to goo and dribbled out my ears about 4 hours ago.... > > Euughg. I suppose this means that ->iomap_begin has to be able to stuff > xfs_ifork.if_seq into a cookie value that we can set in struct iomap > somewhere? And then iomap has to be able to call back to XFS to > revalidate iomap.cookie? And if iomap.cookie != xfs_ifork.if_seq, then > I guess iomap will just ... unlock and put the folio, and return EAGAIN > (or something) so that its caller will return however many bytes have > been dirtied to iomap_apply so that _apply will go get a fresh mapping? Yeah, something like that. I have not had time to think this through any further than what you've just written. It does not look like it will be trivial, though, and it points out that every filesystem using iomap for multi-page extent maps will need to implement iomap invalidation detection in some way. We already have that for XFS with the sequence numbers, but I'm betting that none of the other filesystems that use iomap have the internal infrastructure to track racing extent map changes... Cheers, Dave.
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index da4af2142a2b..cdc39f576ca1 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -87,6 +87,7 @@ xfs_inode_alloc( /* VFS doesn't initialise i_mode or i_state! */ VFS_I(ip)->i_mode = 0; VFS_I(ip)->i_state = 0; + mapping_set_large_folios(VFS_I(ip)->i_mapping); XFS_STATS_INC(mp, vn_active); ASSERT(atomic_read(&ip->i_pincount) == 0); @@ -320,6 +321,7 @@ xfs_reinit_inode( inode->i_rdev = dev; inode->i_uid = uid; inode->i_gid = gid; + mapping_set_large_folios(inode->i_mapping); return error; }