diff mbox

"Bad page state" errors when calling BULKSTAT under memory pressure?

Message ID 20160824234237.GA22760@birch.djwong.org (mailing list archive)
State Not Applicable
Headers show

Commit Message

Darrick J. Wong Aug. 24, 2016, 11:42 p.m. UTC
Hi everyone,

[cc'ing Brian because he was the last one to touch xfs_buf.c]

I've been stress-testing xfs_scrub against a 900GB filesystem with 2M inodes
using a VM with 512M of RAM.  I've noticed that I get BUG messages about
pages with negative refcount, but only if the system is under memory pressure.
No errors are seen if the VM memory is increased to, say, 20GB.

: BUG: Bad page state in process xfs_scrub  pfn:00426
: page:ffffea0000010980 count:-1 mapcount:0 mapping:          (null) index:0x0
: flags: 0x0()
: page dumped because: nonzero _count
: Modules linked in: xfs libcrc32c sch_fq_codel af_packet
: CPU: 1 PID: 2058 Comm: xfs_scrub Not tainted 4.8.0-rc3-mcsum #18
: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
:  0000000000000000 ffff906f9c4237b0 ffffffff98331bf3 ffffea0000010980
:  ffffffff989d923c ffff906f9c4237d8 ffffffff9819c5b4 0000000000000000
:  0000000000000000 0000000000000000 ffff906f9c4237e8 ffffffff9819c701
: Call Trace:
:  [<ffffffff98331bf3>] dump_stack+0x85/0xc2
:  [<ffffffff9819c5b4>] bad_page+0xc4/0x130
:  [<ffffffff9819c701>] check_new_page_bad+0x71/0x80
:  [<ffffffff9819fd97>] get_page_from_freelist+0x907/0xa60
:  [<ffffffff981a05a6>] __alloc_pages_nodemask+0x156/0xf70
:  [<ffffffff9812812f>] ? is_module_address+0x2f/0x50
:  [<ffffffffc03ca746>] xfs_buf_allocate_memory+0x1f6/0x2f0 [xfs]
:  [<ffffffffc03cc49d>] xfs_buf_get_map+0x1bd/0x400 [xfs]
:  [<ffffffffc03cdb2c>] xfs_buf_read_map+0x2c/0x360 [xfs]
:  [<ffffffff98307767>] ? queue_unplugged+0x57/0x190
:  [<ffffffffc03cdeb2>] xfs_buf_readahead_map+0x52/0x70 [xfs]
:  [<ffffffffc037fdfd>] xfs_btree_reada_bufs+0x9d/0xb0 [xfs]
:  [<ffffffffc03eea6a>] xfs_bulkstat_ichunk_ra+0xda/0x110 [xfs]
:  [<ffffffffc037ef14>] ? xfs_btree_check_block+0x14/0x20 [xfs]
:  [<ffffffffc03eeec8>] xfs_bulkstat+0x2b8/0x6c0 [xfs]
:  [<ffffffffc03ee960>] ? xfs_bulkstat_one_int+0x430/0x430 [xfs]
:  [<ffffffffc03e0826>] xfs_ioc_bulkstat+0x116/0x210 [xfs]
:  [<ffffffffc03e2631>] xfs_file_ioctl+0x591/0xf10 [xfs]
:  [<ffffffff98045b03>] ? kvm_clock_read+0x23/0x40
:  [<ffffffff98045b39>] ? kvm_sched_clock_read+0x9/0x20
:  [<ffffffff98022729>] ? sched_clock+0x9/0x10
:  [<ffffffff980dddcd>] ? __lock_acquire+0x25d/0x14b0
:  [<ffffffff98234993>] ? __fget+0xf3/0x200
:  [<ffffffff98234993>] ? __fget+0xf3/0x200
:  [<ffffffff98350277>] ? debug_smp_processor_id+0x17/0x20
:  [<ffffffff980d9a19>] ? get_lock_stats+0x19/0x60
:  [<ffffffff982276d3>] do_vfs_ioctl+0x93/0x700
:  [<ffffffff982349b0>] ? __fget+0x110/0x200
:  [<ffffffff982348a5>] ? __fget+0x5/0x200
:  [<ffffffff98227d81>] SyS_ioctl+0x41/0x70
:  [<ffffffff9866d6bc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
: Disabling lock debugging due to kernel taint

Obviously, a page refcount of -1 is not a good sign.  I had a hunch that
the page in question was (hopefully) a page backing an xfs_buf, so I
applied the following debug patch:


I then saw this:

: SGI XFS with ACLs, security attributes, realtime, debug enabled
: XFS (sda): Mounting V4 Filesystem
: XFS (sda): Ending clean mount
: XFS: xfs_buf_free: OHNO! daddr=113849120 page=ffffea0000010980 ref=0
: CPU: 2 PID: 2058 Comm: xfs_scrub Not tainted 4.8.0-rc3-mcsum #18
: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
:  0000000000000000 ffff906f9c4239f8 ffffffff98331bf3 ffff906f95e86b00
:  0000000000000000 ffff906f9c423a20 ffffffffc03ca53e ffff906f9c604900
:  0000000000000000 0000000000010015 ffff906f9c423a60 ffffffffc03cc640
: Call Trace:
:  [<ffffffff98331bf3>] dump_stack+0x85/0xc2
:  [<ffffffffc03ca53e>] xfs_buf_free+0x27e/0x290 [xfs]
:  [<ffffffffc03cc640>] xfs_buf_get_map+0x360/0x400 [xfs]
:  [<ffffffffc03cdb2c>] xfs_buf_read_map+0x2c/0x360 [xfs]
:  [<ffffffffc03cdeb2>] ? xfs_buf_readahead_map+0x52/0x70 [xfs]
:  [<ffffffffc03cdeb2>] xfs_buf_readahead_map+0x52/0x70 [xfs]
:  [<ffffffffc037fdfd>] xfs_btree_reada_bufs+0x9d/0xb0 [xfs]
:  [<ffffffffc03eea6a>] xfs_bulkstat_ichunk_ra+0xda/0x110 [xfs]
:  [<ffffffffc037ef14>] ? xfs_btree_check_block+0x14/0x20 [xfs]
:  [<ffffffffc03eeec8>] xfs_bulkstat+0x2b8/0x6c0 [xfs]
:  [<ffffffffc03ee960>] ? xfs_bulkstat_one_int+0x430/0x430 [xfs]
:  [<ffffffffc03e0826>] xfs_ioc_bulkstat+0x116/0x210 [xfs]
:  [<ffffffffc03e2631>] xfs_file_ioctl+0x591/0xf10 [xfs]
:  [<ffffffff98045b03>] ? kvm_clock_read+0x23/0x40
:  [<ffffffff98045b39>] ? kvm_sched_clock_read+0x9/0x20
:  [<ffffffff98022729>] ? sched_clock+0x9/0x10
:  [<ffffffff980dddcd>] ? __lock_acquire+0x25d/0x14b0
:  [<ffffffff98234993>] ? __fget+0xf3/0x200
:  [<ffffffff98234993>] ? __fget+0xf3/0x200
:  [<ffffffff98350277>] ? debug_smp_processor_id+0x17/0x20
:  [<ffffffff980d9a19>] ? get_lock_stats+0x19/0x60
:  [<ffffffff982276d3>] do_vfs_ioctl+0x93/0x700
:  [<ffffffff982349b0>] ? __fget+0x110/0x200
:  [<ffffffff982348a5>] ? __fget+0x5/0x200
:  [<ffffffff98227d81>] SyS_ioctl+0x41/0x70
:  [<ffffffff9866d6bc>] entry_SYSCALL_64_fastpath+0x1f/0xbd

...followed several seconds later by the BUG message seen earlier.
It looks as though the bulkstat call is attempting to start readahead
of the inode chunk, but I'm not familiar enough with how RA works to
readily tell what's going on here, or why it's freeing a xfs_buf as
part of starting RA.  The sector 113849120 does indeed contain an
inode.

# xfs_info /mnt
meta-data=/dev/sda               isize=512    agcount=4, agsize=56308224 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=225232896, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=109977, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
# df -i /mnt ; df -k /mnt
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/sda         430M  2.0M  428M    1% /mnt
Filesystem     1K-blocks      Used Available Use% Mounted on
/dev/sda       900491676 326109388 574382288  37% /mnt

Kernel is 4.8-rc3 + Darrick's giant patchset, though I can reproduce
it with plain 4.8-rc3 if I beat on bulkstat somewhat harder.  Not
sure what the fix is here, though disabling inode chunk readahead
for bulkstat does make the BUG go away. :$

--D

Comments

Dave Chinner Aug. 25, 2016, 12:35 a.m. UTC | #1
On Wed, Aug 24, 2016 at 04:42:37PM -0700, Darrick J. Wong wrote:
> Hi everyone,
> 
> [cc'ing Brian because he was the last one to touch xfs_buf.c]
> 
> I've been stress-testing xfs_scrub against a 900GB filesystem with 2M inodes
> using a VM with 512M of RAM.  I've noticed that I get BUG messages about
> pages with negative refcount, but only if the system is under memory pressure.
> No errors are seen if the VM memory is increased to, say, 20GB.
> 
> : BUG: Bad page state in process xfs_scrub  pfn:00426
> : page:ffffea0000010980 count:-1 mapcount:0 mapping:          (null) index:0x0
> : flags: 0x0()
> : page dumped because: nonzero _count

Unless we are double-freeing a buffer, that's not an XFS problem.
Have you tried with memory posioning and allocation debug turned on?

> : Modules linked in: xfs libcrc32c sch_fq_codel af_packet
> : CPU: 1 PID: 2058 Comm: xfs_scrub Not tainted 4.8.0-rc3-mcsum #18

the mm architecture was significantly modified in 4.8.0-rc1 - it
went from per-zone to per-node infrastructure, so it's entirely
possible this is a memory reclaim regression. can you reproduce it
on an older kernel (e.g. 4.7.0)?

> Obviously, a page refcount of -1 is not a good sign.  I had a hunch that
> the page in question was (hopefully) a page backing an xfs_buf, so I
> applied the following debug patch:
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 607cc29..144b976 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -317,7 +317,7 @@ xfs_buf_free(
>  
>                 for (i = 0; i < bp->b_page_count; i++) {
>                         struct page     *page = bp->b_pages[i];
> -
> +if (page_ref_count(page) != 1) {xfs_err(NULL, "%s: OHNO! daddr=%llu page=%p ref=%d", __func__, bp->b_bn, page, page_ref_count(page)); dump_stack();}
>                         __free_page(page);
>                 }
>         } else if (bp->b_flags & _XBF_KMEM)
> 
> I then saw this:
> 
> : SGI XFS with ACLs, security attributes, realtime, debug enabled
> : XFS (sda): Mounting V4 Filesystem
> : XFS (sda): Ending clean mount
> : XFS: xfs_buf_free: OHNO! daddr=113849120 page=ffffea0000010980 ref=0

Which implies something else has dropped the page reference count on
us while we hold a reference to it. What you might like to check
what the page reference counts are on /allocation/ to see if we're
being handed a page from the freelist with a bad ref count....

If the ref counts are good at allocation, but bad on free, then I
very much doubt it's an XFS problem. We don't actually touch the
page reference count anywhere, so let's make sure that it's not a
double free or something like that in XFS first.

Cheers,

Dave.
Darrick J. Wong Aug. 25, 2016, 12:48 a.m. UTC | #2
On Thu, Aug 25, 2016 at 10:35:48AM +1000, Dave Chinner wrote:
> On Wed, Aug 24, 2016 at 04:42:37PM -0700, Darrick J. Wong wrote:
> > Hi everyone,
> > 
> > [cc'ing Brian because he was the last one to touch xfs_buf.c]
> > 
> > I've been stress-testing xfs_scrub against a 900GB filesystem with 2M inodes
> > using a VM with 512M of RAM.  I've noticed that I get BUG messages about
> > pages with negative refcount, but only if the system is under memory pressure.
> > No errors are seen if the VM memory is increased to, say, 20GB.
> > 
> > : BUG: Bad page state in process xfs_scrub  pfn:00426
> > : page:ffffea0000010980 count:-1 mapcount:0 mapping:          (null) index:0x0
> > : flags: 0x0()
> > : page dumped because: nonzero _count
> 
> Unless we are double-freeing a buffer, that's not an XFS problem.
> Have you tried with memory posioning and allocation debug turned on?

Yes.  The BUG did not reproduce, though it did take nearly 35min to run
scrub (which usually takes ~2min).

> > : Modules linked in: xfs libcrc32c sch_fq_codel af_packet
> > : CPU: 1 PID: 2058 Comm: xfs_scrub Not tainted 4.8.0-rc3-mcsum #18
> 
> the mm architecture was significantly modified in 4.8.0-rc1 - it
> went from per-zone to per-node infrastructure, so it's entirely
> possible this is a memory reclaim regression. can you reproduce it
> on an older kernel (e.g. 4.7.0)?

I'll try.  I noticed that it's easier to make it happen when scrub is
using getfsmap and/or the new in-kernel scrubbers, but that's no big
surprise since that means we're pounding harder on the metadata. :)

> > Obviously, a page refcount of -1 is not a good sign.  I had a hunch that
> > the page in question was (hopefully) a page backing an xfs_buf, so I
> > applied the following debug patch:
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 607cc29..144b976 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -317,7 +317,7 @@ xfs_buf_free(
> >  
> >                 for (i = 0; i < bp->b_page_count; i++) {
> >                         struct page     *page = bp->b_pages[i];
> > -
> > +if (page_ref_count(page) != 1) {xfs_err(NULL, "%s: OHNO! daddr=%llu page=%p ref=%d", __func__, bp->b_bn, page, page_ref_count(page)); dump_stack();}
> >                         __free_page(page);
> >                 }
> >         } else if (bp->b_flags & _XBF_KMEM)
> > 
> > I then saw this:
> > 
> > : SGI XFS with ACLs, security attributes, realtime, debug enabled
> > : XFS (sda): Mounting V4 Filesystem
> > : XFS (sda): Ending clean mount
> > : XFS: xfs_buf_free: OHNO! daddr=113849120 page=ffffea0000010980 ref=0
> 
> Which implies something else has dropped the page reference count on
> us while we hold a reference to it. What you might like to check
> what the page reference counts are on /allocation/ to see if we're
> being handed a page from the freelist with a bad ref count....

Zero on allocation, except when we hit the BUG case.

> If the ref counts are good at allocation, but bad on free, then I
> very much doubt it's an XFS problem. We don't actually touch the
> page reference count anywhere, so let's make sure that it's not a
> double free or something like that in XFS first.

I couldn't find any smoking gun inside XFS, which is why I went to
the list -- I figured something must be doing something I don't know
about. :)

Anyway, I was going to push out the reflink patches for review, but
the scrubber crashing held me up.  Tomorrow, probably. :/

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox

Patch

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 607cc29..144b976 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -317,7 +317,7 @@  xfs_buf_free(
 
                for (i = 0; i < bp->b_page_count; i++) {
                        struct page     *page = bp->b_pages[i];
-
+if (page_ref_count(page) != 1) {xfs_err(NULL, "%s: OHNO! daddr=%llu page=%p ref=%d", __func__, bp->b_bn, page, page_ref_count(page)); dump_stack();}
                        __free_page(page);
                }
        } else if (bp->b_flags & _XBF_KMEM)