mbox series

[v2,0/5] Fix memory leak on failed cache-writes

Message ID 20200213155803.14799-1-johannes.thumshirn@wdc.com (mailing list archive)
Headers show
Series Fix memory leak on failed cache-writes | expand

Message

Johannes Thumshirn Feb. 13, 2020, 3:57 p.m. UTC
Fstests' test case generic/475 reliably leaks the btrfs_io_ctl::pages
allocated in __btrfs_write_out_cache().

The first patch in this series is freeing the pages when we throw away a dirty
block group. The other patches are small things I noticed while hunting down the
problem and are independant of fix.

Changes to v1:
- Move fix to the first position (David)

Johannes Thumshirn (5):
  btrfs: free allocated pages on failed cache write-out
  btrfs: use inode from io_ctl in io_ctl_prepare_pages
  btrfs: make the uptodate argument of io_ctl_add_pages() boolean.
  btrfs: use standard debug config option to enable free-space-cache
    debug prints
  btrfs: simplify error handling in __btrfs_write_out_cache()

 fs/btrfs/disk-io.c          |  6 +++++
 fs/btrfs/free-space-cache.c | 44 ++++++++++++++++++++-----------------
 fs/btrfs/free-space-cache.h |  1 +
 3 files changed, 31 insertions(+), 20 deletions(-)

Comments

David Sterba Feb. 18, 2020, 4:50 p.m. UTC | #1
On Fri, Feb 14, 2020 at 12:57:58AM +0900, Johannes Thumshirn wrote:
> Fstests' test case generic/475 reliably leaks the btrfs_io_ctl::pages
> allocated in __btrfs_write_out_cache().
> 
> The first patch in this series is freeing the pages when we throw away a dirty
> block group. The other patches are small things I noticed while hunting down the
> problem and are independant of fix.
> 
> Changes to v1:
> - Move fix to the first position (David)
> 
> Johannes Thumshirn (5):
>   btrfs: free allocated pages on failed cache write-out
>   btrfs: use inode from io_ctl in io_ctl_prepare_pages
>   btrfs: make the uptodate argument of io_ctl_add_pages() boolean.
>   btrfs: use standard debug config option to enable free-space-cache
>     debug prints
>   btrfs: simplify error handling in __btrfs_write_out_cache()

I've seen some weird test failures and this patchset was in the tested
branch (either for-next or standalone). I need to retest misc-next again
to have a clean baseline so I can see the diff.
Johannes Thumshirn Feb. 18, 2020, 4:54 p.m. UTC | #2
On 18/02/2020 17:50, David Sterba wrote:
> On Fri, Feb 14, 2020 at 12:57:58AM +0900, Johannes Thumshirn wrote:
>> Fstests' test case generic/475 reliably leaks the btrfs_io_ctl::pages
>> allocated in __btrfs_write_out_cache().
>>
>> The first patch in this series is freeing the pages when we throw away a dirty
>> block group. The other patches are small things I noticed while hunting down the
>> problem and are independant of fix.
>>
>> Changes to v1:
>> - Move fix to the first position (David)
>>
>> Johannes Thumshirn (5):
>>    btrfs: free allocated pages on failed cache write-out
>>    btrfs: use inode from io_ctl in io_ctl_prepare_pages
>>    btrfs: make the uptodate argument of io_ctl_add_pages() boolean.
>>    btrfs: use standard debug config option to enable free-space-cache
>>      debug prints
>>    btrfs: simplify error handling in __btrfs_write_out_cache()
> 
> I've seen some weird test failures and this patchset was in the tested
> branch (either for-next or standalone). I need to retest misc-next again
> to have a clean baseline so I can see the diff.
> 

Interesting, can you forward me the failures?
David Sterba Feb. 24, 2020, 5:49 p.m. UTC | #3
On Tue, Feb 18, 2020 at 04:54:18PM +0000, Johannes Thumshirn wrote:
> On 18/02/2020 17:50, David Sterba wrote:
> > On Fri, Feb 14, 2020 at 12:57:58AM +0900, Johannes Thumshirn wrote:
> >> Fstests' test case generic/475 reliably leaks the btrfs_io_ctl::pages
> >> allocated in __btrfs_write_out_cache().
> >>
> >> The first patch in this series is freeing the pages when we throw away a dirty
> >> block group. The other patches are small things I noticed while hunting down the
> >> problem and are independant of fix.
> >>
> >> Changes to v1:
> >> - Move fix to the first position (David)
> >>
> >> Johannes Thumshirn (5):
> >>    btrfs: free allocated pages on failed cache write-out
> >>    btrfs: use inode from io_ctl in io_ctl_prepare_pages
> >>    btrfs: make the uptodate argument of io_ctl_add_pages() boolean.
> >>    btrfs: use standard debug config option to enable free-space-cache
> >>      debug prints
> >>    btrfs: simplify error handling in __btrfs_write_out_cache()
> > 
> > I've seen some weird test failures and this patchset was in the tested
> > branch (either for-next or standalone). I need to retest misc-next again
> > to have a clean baseline so I can see the diff.
> > 
> 
> Interesting, can you forward me the failures?

So it was test btrfs/125 but I can't say for sure which patchset causes
it, there are several reports but unlikely to be caused by yours.

btrfs/125               [14:07:58][ 9038.483946] run fstests btrfs/125 at 2020-02-18 14:07:58
[ 9038.894197] BTRFS info (device vda): disk space caching is enabled
[ 9038.897644] BTRFS info (device vda): has skinny extents
[ 9039.553603] BTRFS: device fsid 853f4928-5113-48fe-b4e5-b2d1f1af15bf devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (29881)
[ 9039.557424] BTRFS: device fsid 853f4928-5113-48fe-b4e5-b2d1f1af15bf devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (29881)
[ 9039.561587] BTRFS: device fsid 853f4928-5113-48fe-b4e5-b2d1f1af15bf devid 3 transid 5 /dev/vdd scanned by mkfs.btrfs (29881)
[ 9039.581158] BTRFS info (device vdb): turning on sync discard
[ 9039.584130] BTRFS info (device vdb): disk space caching is enabled
[ 9039.587021] BTRFS info (device vdb): has skinny extents
[ 9039.589100] BTRFS info (device vdb): flagging fs with big metadata feature
[ 9039.595079] BTRFS info (device vdb): checking UUID tree
[ 9039.706233] BTRFS: device fsid 853f4928-5113-48fe-b4e5-b2d1f1af15bf devid 2 transid 7 /dev/vdc scanned by mount (29923)
[ 9039.711721] BTRFS: device fsid 853f4928-5113-48fe-b4e5-b2d1f1af15bf devid 1 transid 7 /dev/vdb scanned by mount (29923)
[ 9039.715946] BTRFS info (device vdb): allowing degraded mounts
[ 9039.717785] BTRFS info (device vdb): disk space caching is enabled
[ 9039.719914] BTRFS info (device vdb): has skinny extents
[ 9039.723260] BTRFS warning (device vdb): devid 3 uuid a55e3334-e24b-4220-93a2-9aaec7286042 is missing
[ 9039.726785] BTRFS warning (device vdb): devid 3 uuid a55e3334-e24b-4220-93a2-9aaec7286042 is missing
[ 9040.189984] BTRFS: device fsid 4d1f43bd-2b49-4905-a219-25dc10f1b5fe devid 1 transid 249 /dev/vda scanned by btrfs (29951)
[ 9040.211252] BTRFS info (device vdb): turning on sync discard
[ 9040.223781] BTRFS info (device vdb): disk space caching is enabled
[ 9040.225777] BTRFS info (device vdb): has skinny extents
[ 9050.252423] BTRFS info (device vdb): balance: start -d -m -s
[ 9050.256160] BTRFS info (device vdb): relocating block group 217710592 flags data|raid5
[ 9050.389701] btrfs_print_data_csum_error: 6540 callbacks suppressed
[ 9050.389706] BTRFS warning (device vdb): csum failed root -9 ino 257 off 5357568 csum 0xb5d704296aa5f2d5 expected csum 0xdbbbd8326f9b86ac mirror 1
[ 9050.398381] BTRFS warning (device vdb): csum failed root -9 ino 257 off 5361664 csum 0xf2f7aa2e723d427e expected csum 0xdbbbd8326f9b86ac mirror 1
[ 9050.398899] BTRFS warning (device vdb): csum failed root -9 ino 257 off 3424256 csum 0x15e7d27c0f4b4f20 expected csum 0xdbbbd8326f9b86ac mirror 1
[ 9050.403918] BTRFS warning (device vdb): csum failed root -9 ino 257 off 5365760 csum 0x25464d4ee302ec50 expected csum 0xdbbbd8326f9b86ac mirror 1
[ 9050.408525] BTRFS warning (device vdb): csum failed root -9 ino 257 off 3428352 csum 0xcd1278f31a795b3d expected csum 0xdbbbd8326f9b86ac mirror 1
[ 9050.412556] BTRFS warning (device vdb): csum failed root -9 ino 257 off 5369856 csum 0x045d00c41342cf48 expected csum 0xdbbbd8326f9b86ac mirror 1
[ 9050.417133] BTRFS warning (device vdb): csum failed root -9 ino 257 off 3432448 csum 0xab3b271dc78f0131 expected csum 0xdbbbd8326f9b86ac mirror 1
[ 9050.425859] BTRFS warning (device vdb): csum failed root -9 ino 257 off 3436544 csum 0xa1e70b1d4d61d53f expected csum 0xdbbbd8326f9b86ac mirror 1
[ 9050.430432] repair_io_failure: 9 callbacks suppressed
[ 9050.430435] BTRFS info (device vdb): read error corrected: ino 257 off 3424256 (dev /dev/vdd sector 196512)
[ 9050.430464] BTRFS info (device vdb): read error corrected: ino 257 off 3428352 (dev /dev/vdd sector 196520)
[ 9050.430473] BTRFS info (device vdb): read error corrected: ino 257 off 5361664 (dev /dev/vdd sector 198376)
[ 9050.430681] BTRFS info (device vdb): read error corrected: ino 257 off 3432448 (dev /dev/vdd sector 196528)
[ 9050.430762] BTRFS info (device vdb): read error corrected: ino 257 off 5357568 (dev /dev/vdd sector 198368)
[ 9050.430805] BTRFS info (device vdb): read error corrected: ino 257 off 5369856 (dev /dev/vdd sector 198392)
[ 9050.430853] BTRFS info (device vdb): read error corrected: ino 257 off 3436544 (dev /dev/vdd sector 196536)
[ 9050.432557] BTRFS info (device vdb): read error corrected: ino 257 off 5365760 (dev /dev/vdd sector 198384)
[ 9050.469621] BTRFS warning (device vdb): csum failed root -9 ino 257 off 13959168 csum 0x774317b489e76165 expected csum 0xdbbbd8326f9b86ac mirror 1
[ 9050.476527] BTRFS info (device vdb): read error corrected: ino 257 off 13959168 (dev /dev/vdd sector 206720)
[ 9050.501930] BTRFS warning (device vdb): csum failed root -9 ino 257 off 16351232 csum 0xa735bf811aa10b77 expected csum 0xdbbbd8326f9b86ac mirror 1
[ 9050.512917] BTRFS info (device vdb): read error corrected: ino 257 off 17608704 (dev /dev/vdd sector 210392)
[ 9050.708681] BTRFS error (device vdb): bad tree block start, want 39059456 have 0
[ 9050.708684] BTRFS error (device vdb): bad tree block start, want 39075840 have 0
[ 9050.708699] BTRFS error (device vdb): bad tree block start, want 39092224 have 0
[ 9050.708700] BTRFS error (device vdb): bad tree block start, want 39108608 have 0
[ 9050.713287] BTRFS error (device vdb): bad tree block start, want 39059456 have 0
[ 9050.727830] BTRFS error (device vdb): bad tree block start, want 39059456 have 0
[ 9050.727849] BTRFS error (device vdb): bad tree block start, want 39075840 have 0
[ 9050.727854] BTRFS error (device vdb): bad tree block start, want 39092224 have 0
[ 9050.727876] BTRFS error (device vdb): bad tree block start, want 39108608 have 0
[ 9050.730601] BTRFS error (device vdb): bad tree block start, want 39059456 have 0
[ 9053.133149] BTRFS info (device vdb): balance: ended with status: -5
[failed, exit status 1][ 9053.200709] BTRFS info (device vdb): clearing incompat feature flag for RAID56 (0x80)
 [14:08:12]- output mismatch (see /tmp/fstests/results//btrfs/125.out.bad)
    --- tests/btrfs/125.out     2018-04-12 16:57:00.616225550 +0000
    +++ /tmp/fstests/results//btrfs/125.out.bad 2020-02-18 14:08:12.856000000 +0000
    @@ -3,5 +3,5 @@
     Write data with degraded mount
     
     Mount normal and balance
    -
    -Mount degraded but with other dev
    +failed: '/sbin/btrfs balance start /tmp/scratch'
    +(see /tmp/fstests/results//btrfs/125.full for details)
    ...
    (Run 'diff -u /tmp/fstests/tests/btrfs/125.out /tmp/fstests/results//btrfs/125.out.bad'  to see the entire diff)
---
Johannes Thumshirn Feb. 25, 2020, 12:23 a.m. UTC | #4
On 24/02/2020 09:50, David Sterba wrote:
[...]
> So it was test btrfs/125 but I can't say for sure which patchset causes
> it, there are several reports but unlikely to be caused by yours.
> 
> btrfs/125               [14:07:58][ 9038.483946] run fstests btrfs/125 at 2020-02-18 14:07:58
> [ 9038.894197] BTRFS info (device vda): disk space caching is enabled
> [ 9038.897644] BTRFS info (device vda): has skinny extents
> [ 9039.553603] BTRFS: device fsid 853f4928-5113-48fe-b4e5-b2d1f1af15bf devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (29881)
> [ 9039.557424] BTRFS: device fsid 853f4928-5113-48fe-b4e5-b2d1f1af15bf devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (29881)
> [ 9039.561587] BTRFS: device fsid 853f4928-5113-48fe-b4e5-b2d1f1af15bf devid 3 transid 5 /dev/vdd scanned by mkfs.btrfs (29881)
> [ 9039.581158] BTRFS info (device vdb): turning on sync discard
> [ 9039.584130] BTRFS info (device vdb): disk space caching is enabled
> [ 9039.587021] BTRFS info (device vdb): has skinny extents
> [ 9039.589100] BTRFS info (device vdb): flagging fs with big metadata feature
> [ 9039.595079] BTRFS info (device vdb): checking UUID tree
> [ 9039.706233] BTRFS: device fsid 853f4928-5113-48fe-b4e5-b2d1f1af15bf devid 2 transid 7 /dev/vdc scanned by mount (29923)
> [ 9039.711721] BTRFS: device fsid 853f4928-5113-48fe-b4e5-b2d1f1af15bf devid 1 transid 7 /dev/vdb scanned by mount (29923)
> [ 9039.715946] BTRFS info (device vdb): allowing degraded mounts
> [ 9039.717785] BTRFS info (device vdb): disk space caching is enabled
> [ 9039.719914] BTRFS info (device vdb): has skinny extents
> [ 9039.723260] BTRFS warning (device vdb): devid 3 uuid a55e3334-e24b-4220-93a2-9aaec7286042 is missing
> [ 9039.726785] BTRFS warning (device vdb): devid 3 uuid a55e3334-e24b-4220-93a2-9aaec7286042 is missing
> [ 9040.189984] BTRFS: device fsid 4d1f43bd-2b49-4905-a219-25dc10f1b5fe devid 1 transid 249 /dev/vda scanned by btrfs (29951)
> [ 9040.211252] BTRFS info (device vdb): turning on sync discard
> [ 9040.223781] BTRFS info (device vdb): disk space caching is enabled
> [ 9040.225777] BTRFS info (device vdb): has skinny extents
> [ 9050.252423] BTRFS info (device vdb): balance: start -d -m -s
> [ 9050.256160] BTRFS info (device vdb): relocating block group 217710592 flags data|raid5
> [ 9050.389701] btrfs_print_data_csum_error: 6540 callbacks suppressed
> [ 9050.389706] BTRFS warning (device vdb): csum failed root -9 ino 257 off 5357568 csum 0xb5d704296aa5f2d5 expected csum 0xdbbbd8326f9b86ac mirror 1
> [ 9050.398381] BTRFS warning (device vdb): csum failed root -9 ino 257 off 5361664 csum 0xf2f7aa2e723d427e expected csum 0xdbbbd8326f9b86ac mirror 1
> [ 9050.398899] BTRFS warning (device vdb): csum failed root -9 ino 257 off 3424256 csum 0x15e7d27c0f4b4f20 expected csum 0xdbbbd8326f9b86ac mirror 1
> [ 9050.403918] BTRFS warning (device vdb): csum failed root -9 ino 257 off 5365760 csum 0x25464d4ee302ec50 expected csum 0xdbbbd8326f9b86ac mirror 1
> [ 9050.408525] BTRFS warning (device vdb): csum failed root -9 ino 257 off 3428352 csum 0xcd1278f31a795b3d expected csum 0xdbbbd8326f9b86ac mirror 1
> [ 9050.412556] BTRFS warning (device vdb): csum failed root -9 ino 257 off 5369856 csum 0x045d00c41342cf48 expected csum 0xdbbbd8326f9b86ac mirror 1
> [ 9050.417133] BTRFS warning (device vdb): csum failed root -9 ino 257 off 3432448 csum 0xab3b271dc78f0131 expected csum 0xdbbbd8326f9b86ac mirror 1
> [ 9050.425859] BTRFS warning (device vdb): csum failed root -9 ino 257 off 3436544 csum 0xa1e70b1d4d61d53f expected csum 0xdbbbd8326f9b86ac mirror 1
> [ 9050.430432] repair_io_failure: 9 callbacks suppressed
> [ 9050.430435] BTRFS info (device vdb): read error corrected: ino 257 off 3424256 (dev /dev/vdd sector 196512)
> [ 9050.430464] BTRFS info (device vdb): read error corrected: ino 257 off 3428352 (dev /dev/vdd sector 196520)
> [ 9050.430473] BTRFS info (device vdb): read error corrected: ino 257 off 5361664 (dev /dev/vdd sector 198376)
> [ 9050.430681] BTRFS info (device vdb): read error corrected: ino 257 off 3432448 (dev /dev/vdd sector 196528)
> [ 9050.430762] BTRFS info (device vdb): read error corrected: ino 257 off 5357568 (dev /dev/vdd sector 198368)
> [ 9050.430805] BTRFS info (device vdb): read error corrected: ino 257 off 5369856 (dev /dev/vdd sector 198392)
> [ 9050.430853] BTRFS info (device vdb): read error corrected: ino 257 off 3436544 (dev /dev/vdd sector 196536)
> [ 9050.432557] BTRFS info (device vdb): read error corrected: ino 257 off 5365760 (dev /dev/vdd sector 198384)
> [ 9050.469621] BTRFS warning (device vdb): csum failed root -9 ino 257 off 13959168 csum 0x774317b489e76165 expected csum 0xdbbbd8326f9b86ac mirror 1
> [ 9050.476527] BTRFS info (device vdb): read error corrected: ino 257 off 13959168 (dev /dev/vdd sector 206720)
> [ 9050.501930] BTRFS warning (device vdb): csum failed root -9 ino 257 off 16351232 csum 0xa735bf811aa10b77 expected csum 0xdbbbd8326f9b86ac mirror 1
> [ 9050.512917] BTRFS info (device vdb): read error corrected: ino 257 off 17608704 (dev /dev/vdd sector 210392)
> [ 9050.708681] BTRFS error (device vdb): bad tree block start, want 39059456 have 0
> [ 9050.708684] BTRFS error (device vdb): bad tree block start, want 39075840 have 0
> [ 9050.708699] BTRFS error (device vdb): bad tree block start, want 39092224 have 0
> [ 9050.708700] BTRFS error (device vdb): bad tree block start, want 39108608 have 0
> [ 9050.713287] BTRFS error (device vdb): bad tree block start, want 39059456 have 0
> [ 9050.727830] BTRFS error (device vdb): bad tree block start, want 39059456 have 0
> [ 9050.727849] BTRFS error (device vdb): bad tree block start, want 39075840 have 0
> [ 9050.727854] BTRFS error (device vdb): bad tree block start, want 39092224 have 0
> [ 9050.727876] BTRFS error (device vdb): bad tree block start, want 39108608 have 0
> [ 9050.730601] BTRFS error (device vdb): bad tree block start, want 39059456 have 0
> [ 9053.133149] BTRFS info (device vdb): balance: ended with status: -5
> [failed, exit status 1][ 9053.200709] BTRFS info (device vdb): clearing incompat feature flag for RAID56 (0x80)
>   [14:08:12]- output mismatch (see /tmp/fstests/results//btrfs/125.out.bad)
>      --- tests/btrfs/125.out     2018-04-12 16:57:00.616225550 +0000
>      +++ /tmp/fstests/results//btrfs/125.out.bad 2020-02-18 14:08:12.856000000 +0000
>      @@ -3,5 +3,5 @@
>       Write data with degraded mount
>       
>       Mount normal and balance
>      -
>      -Mount degraded but with other dev
>      +failed: '/sbin/btrfs balance start /tmp/scratch'
>      +(see /tmp/fstests/results//btrfs/125.full for details)
>      ...
>      (Run 'diff -u /tmp/fstests/tests/btrfs/125.out /tmp/fstests/results//btrfs/125.out.bad'  to see the entire diff)

OK it need some time (up to approx 150 runs of btrfs/125) to resproduce 
this failure but I can re-create it.

I've seen it once without the patchset applied as well though (a.k.a 
HEAD == 480be04e7fdcddfed86fd59bb668a134b6d7393e).

I have to verify that though not that I'm seeing ghosts.
Johannes Thumshirn Feb. 25, 2020, 12:56 a.m. UTC | #5
On 24/02/2020 16:23, Johannes Thumshirn wrote:
[...]
> OK it need some time (up to approx 150 runs of btrfs/125) to resproduce
> this failure but I can re-create it.
> 
> I've seen it once without the patchset applied as well though (a.k.a
> HEAD == 480be04e7fdcddfed86fd59bb668a134b6d7393e).
> 
> I have to verify that though not that I'm seeing ghosts.
> 

OK, so I've hit it again with the above commit as HEAD, so I don't think 
these patches participate in the error.

I'll try to figure out which patch does though.

Byte,
	Johannes