dax: Fix deadlock in dax_lock_mapping_entry()

Message ID	20180927112332.3649-1-jack@suse.cz (mailing list archive)
State	Accepted
Commit	f52afc93cd018fe6910133a05d44671192d1aeb0
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=195.135.220.15; helo=mx1.suse.de; envelope-from=jack@suse.cz; receiver=linux-nvdimm@lists.01.org From: Jan Kara <jack@suse.cz> To: Dan Williams <dan.j.williams@intel.com> Subject: [PATCH] dax: Fix deadlock in dax_lock_mapping_entry() Date: Thu, 27 Sep 2018 13:23:32 +0200 Message-Id: <20180927112332.3649-1-jack@suse.cz> Precedence: list Cc: linux-fsdevel@vger.kernel.org, Barret Rhoden <brho@google.com>, Jan Kara <jack@suse.cz>, linux-nvdimm@lists.01.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>
Series	dax: Fix deadlock in dax_lock_mapping_entry() \| expand dax: Fix deadlock in dax_lock_mapping_entry()

Jan Kara Sept. 27, 2018, 11:23 a.m. UTC

When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
fail to unlock mapping->i_pages spinlock and thus immediately deadlock
against itself when retrying to grab the entry lock again. Fix the
problem by unlocking mapping->i_pages before retrying.

Fixes: c2a7d2a115525d3501d38e23d24875a79a07e15e
Reported-by: Barret Rhoden <brho@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 1 +
 1 file changed, 1 insertion(+)

Dan, can you please get this merged? Otherwise dax_lock_mapping_entry()
deadlocks as soon as there's any contention.

Matthew Wilcox Sept. 27, 2018, 1:28 p.m. UTC | #1

On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> against itself when retrying to grab the entry lock again. Fix the
> problem by unlocking mapping->i_pages before retrying.

It seems weird that xfstests doesn't provoke this ...

Jan Kara Sept. 27, 2018, 1:41 p.m. UTC | #2

On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > against itself when retrying to grab the entry lock again. Fix the
> > problem by unlocking mapping->i_pages before retrying.
> 
> It seems weird that xfstests doesn't provoke this ...

The function currently gets called only from mm/memory-failure.c. And yes,
we are lacking DAX hwpoison error tests in fstests...

								Honza

Dan Williams Sept. 27, 2018, 6:22 p.m. UTC | #3

On Thu, Sep 27, 2018 at 6:41 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> > > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > > against itself when retrying to grab the entry lock again. Fix the
> > > problem by unlocking mapping->i_pages before retrying.
> >
> > It seems weird that xfstests doesn't provoke this ...
>
> The function currently gets called only from mm/memory-failure.c. And yes,
> we are lacking DAX hwpoison error tests in fstests...

I have an item on my backlog to port the ndctl unit test that does
memory_failure() injection vs ext4 over to fstests. That said I've
been investigating a deadlock on ext4 caused by this test. When I saw
this patch I hoped it was root cause, but the test is still failing
for me. Vishal is able to pass the test on his system, so the failure
mode is timing dependent. I'm running this patch on top of -rc5 and
still seeing the following deadlock.

    EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
    EXT4-fs (pmem0): mounted filesystem with ordered data mode. Opts: dax
    Injecting memory failure for pfn 0x208900 at process virtual
address 0x7f5872900000
    Memory failure: 0x208900: Killing dax-pmd:7095 due to hardware
memory corruption
    Memory failure: 0x208900: recovery action for dax page: Recovered
    watchdog: BUG: soft lockup - CPU#35 stuck for 22s! [dax-pmd:7095]
    [..]
    irq event stamp: 121911146
    hardirqs last  enabled at (121911145): [<ffffffff81aa1bd9>]
_raw_spin_unlock_irq+0x29/0x40    hardirqs last disabled at
(121911146): [<ffffffff810037a3>] trace_hardirqs_off_thunk+0x1a/0x1c
    softirqs last  enabled at (78238674): [<ffffffff81e0032e>]
__do_softirq+0x32e/0x428
    softirqs last disabled at (78238627): [<ffffffff810bc6f6>]
irq_exit+0xf6/0x100
    CPU: 35 PID: 7095 Comm: dax-pmd Tainted: G           OE
4.19.0-rc5+ #2394
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:lock_release+0x134/0x2a0
    [..]
    Call Trace:
     find_get_entries+0x299/0x3c0
     pagevec_lookup_entries+0x1a/0x30
     dax_layout_busy_page+0x9c/0x280
     ? __lock_acquire+0x12fa/0x1310
     ext4_break_layouts+0x48/0x100
     ? ext4_punch_hole+0x108/0x5a0
     ext4_punch_hole+0x110/0x5a0
     ext4_fallocate+0x189/0xa40
     ? rcu_read_lock_sched_held+0x6b/0x80
     ? rcu_sync_lockdep_assert+0x2e/0x60
     vfs_fallocate+0x13f/0x270

The same test against xfs is not failing for me. I have been seeking
some focus time to dig in on this.

Jan Kara Oct. 4, 2018, 4:27 p.m. UTC | #4

On Thu 27-09-18 11:22:22, Dan Williams wrote:
> On Thu, Sep 27, 2018 at 6:41 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> > > > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > > > against itself when retrying to grab the entry lock again. Fix the
> > > > problem by unlocking mapping->i_pages before retrying.
> > >
> > > It seems weird that xfstests doesn't provoke this ...
> >
> > The function currently gets called only from mm/memory-failure.c. And yes,
> > we are lacking DAX hwpoison error tests in fstests...
> 
> I have an item on my backlog to port the ndctl unit test that does
> memory_failure() injection vs ext4 over to fstests. That said I've
> been investigating a deadlock on ext4 caused by this test. When I saw
> this patch I hoped it was root cause, but the test is still failing
> for me. Vishal is able to pass the test on his system, so the failure
> mode is timing dependent. I'm running this patch on top of -rc5 and
> still seeing the following deadlock.

I went through the code but I don't see where the problem could be. How can
I run that test? Is KVM enough or do I need hardware with AEP dimms?

								Honza

> 
>     EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
>     EXT4-fs (pmem0): mounted filesystem with ordered data mode. Opts: dax
>     Injecting memory failure for pfn 0x208900 at process virtual
> address 0x7f5872900000
>     Memory failure: 0x208900: Killing dax-pmd:7095 due to hardware
> memory corruption
>     Memory failure: 0x208900: recovery action for dax page: Recovered
>     watchdog: BUG: soft lockup - CPU#35 stuck for 22s! [dax-pmd:7095]
>     [..]
>     irq event stamp: 121911146
>     hardirqs last  enabled at (121911145): [<ffffffff81aa1bd9>]
> _raw_spin_unlock_irq+0x29/0x40    hardirqs last disabled at
> (121911146): [<ffffffff810037a3>] trace_hardirqs_off_thunk+0x1a/0x1c
>     softirqs last  enabled at (78238674): [<ffffffff81e0032e>]
> __do_softirq+0x32e/0x428
>     softirqs last disabled at (78238627): [<ffffffff810bc6f6>]
> irq_exit+0xf6/0x100
>     CPU: 35 PID: 7095 Comm: dax-pmd Tainted: G           OE
> 4.19.0-rc5+ #2394
>     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
>     RIP: 0010:lock_release+0x134/0x2a0
>     [..]
>     Call Trace:
>      find_get_entries+0x299/0x3c0
>      pagevec_lookup_entries+0x1a/0x30
>      dax_layout_busy_page+0x9c/0x280
>      ? __lock_acquire+0x12fa/0x1310
>      ext4_break_layouts+0x48/0x100
>      ? ext4_punch_hole+0x108/0x5a0
>      ext4_punch_hole+0x110/0x5a0
>      ext4_fallocate+0x189/0xa40
>      ? rcu_read_lock_sched_held+0x6b/0x80
>      ? rcu_sync_lockdep_assert+0x2e/0x60
>      vfs_fallocate+0x13f/0x270
> 
> The same test against xfs is not failing for me. I have been seeking
> some focus time to dig in on this.

Dan Williams Oct. 5, 2018, 1:57 a.m. UTC | #5

On Thu, Oct 4, 2018 at 9:27 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 27-09-18 11:22:22, Dan Williams wrote:
> > On Thu, Sep 27, 2018 at 6:41 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> > > > > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > > > > against itself when retrying to grab the entry lock again. Fix the
> > > > > problem by unlocking mapping->i_pages before retrying.
> > > >
> > > > It seems weird that xfstests doesn't provoke this ...
> > >
> > > The function currently gets called only from mm/memory-failure.c. And yes,
> > > we are lacking DAX hwpoison error tests in fstests...
> >
> > I have an item on my backlog to port the ndctl unit test that does
> > memory_failure() injection vs ext4 over to fstests. That said I've
> > been investigating a deadlock on ext4 caused by this test. When I saw
> > this patch I hoped it was root cause, but the test is still failing
> > for me. Vishal is able to pass the test on his system, so the failure
> > mode is timing dependent. I'm running this patch on top of -rc5 and
> > still seeing the following deadlock.
>
> I went through the code but I don't see where the problem could be. How can
> I run that test? Is KVM enough or do I need hardware with AEP dimms?

KVM is enough... however, I have found a hack that makes the test pass:

diff --git a/mm/filemap.c b/mm/filemap.c
index 52517f28e6f4..d7f035b1846e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1668,6 +1668,9 @@ unsigned find_get_entries(struct address_space *mapping,
                        goto repeat;
                }
 export:
+               if (iter.index < start)
+                       continue;
+
                indices[ret] = iter.index;
                entries[ret] = page;
                if (++ret == nr_entries)

Is this a radix bug? I would never expect:

    radix_tree_for_each_slot(slot, &mapping->i_pages, &iter, start)

...to return entries with an index < start. Without that change above
we loop forever because dax_layout_busy_page() can't make forward
progress. I'll dig into the radix code tomorrow, but maybe someone
else will be me to it overnight.

Matthew Wilcox Oct. 5, 2018, 2:52 a.m. UTC | #6

On Thu, Oct 04, 2018 at 06:57:52PM -0700, Dan Williams wrote:
> On Thu, Oct 4, 2018 at 9:27 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 27-09-18 11:22:22, Dan Williams wrote:
> > > On Thu, Sep 27, 2018 at 6:41 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> > > > > > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > > > > > against itself when retrying to grab the entry lock again. Fix the
> > > > > > problem by unlocking mapping->i_pages before retrying.
> > > > >
> > > > > It seems weird that xfstests doesn't provoke this ...
> > > >
> > > > The function currently gets called only from mm/memory-failure.c. And yes,
> > > > we are lacking DAX hwpoison error tests in fstests...
> > >
> > > I have an item on my backlog to port the ndctl unit test that does
> > > memory_failure() injection vs ext4 over to fstests. That said I've
> > > been investigating a deadlock on ext4 caused by this test. When I saw
> > > this patch I hoped it was root cause, but the test is still failing
> > > for me. Vishal is able to pass the test on his system, so the failure
> > > mode is timing dependent. I'm running this patch on top of -rc5 and
> > > still seeing the following deadlock.
> >
> > I went through the code but I don't see where the problem could be. How can
> > I run that test? Is KVM enough or do I need hardware with AEP dimms?
> 
> KVM is enough... however, I have found a hack that makes the test pass:
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 52517f28e6f4..d7f035b1846e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1668,6 +1668,9 @@ unsigned find_get_entries(struct address_space *mapping,
>                         goto repeat;
>                 }
>  export:
> +               if (iter.index < start)
> +                       continue;
> +
>                 indices[ret] = iter.index;
>                 entries[ret] = page;
>                 if (++ret == nr_entries)
> 
> Is this a radix bug? I would never expect:
> 
>     radix_tree_for_each_slot(slot, &mapping->i_pages, &iter, start)
> 
> ...to return entries with an index < start. Without that change above
> we loop forever because dax_layout_busy_page() can't make forward
> progress. I'll dig into the radix code tomorrow, but maybe someone
> else will be me to it overnight.

If 'start' is within a 2MB entry, iter.index can absolutely be less
than start.  I forget exactly what the radix tree code does, but I think
it returns index set to the canonical/base index of the entry.

Dan Williams Oct. 5, 2018, 4:01 a.m. UTC | #7

On Thu, Oct 4, 2018 at 7:52 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Oct 04, 2018 at 06:57:52PM -0700, Dan Williams wrote:
> > On Thu, Oct 4, 2018 at 9:27 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 27-09-18 11:22:22, Dan Williams wrote:
> > > > On Thu, Sep 27, 2018 at 6:41 AM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > > > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> > > > > > > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > > > > > > against itself when retrying to grab the entry lock again. Fix the
> > > > > > > problem by unlocking mapping->i_pages before retrying.
> > > > > >
> > > > > > It seems weird that xfstests doesn't provoke this ...
> > > > >
> > > > > The function currently gets called only from mm/memory-failure.c. And yes,
> > > > > we are lacking DAX hwpoison error tests in fstests...
> > > >
> > > > I have an item on my backlog to port the ndctl unit test that does
> > > > memory_failure() injection vs ext4 over to fstests. That said I've
> > > > been investigating a deadlock on ext4 caused by this test. When I saw
> > > > this patch I hoped it was root cause, but the test is still failing
> > > > for me. Vishal is able to pass the test on his system, so the failure
> > > > mode is timing dependent. I'm running this patch on top of -rc5 and
> > > > still seeing the following deadlock.
> > >
> > > I went through the code but I don't see where the problem could be. How can
> > > I run that test? Is KVM enough or do I need hardware with AEP dimms?
> >
> > KVM is enough... however, I have found a hack that makes the test pass:
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 52517f28e6f4..d7f035b1846e 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1668,6 +1668,9 @@ unsigned find_get_entries(struct address_space *mapping,
> >                         goto repeat;
> >                 }
> >  export:
> > +               if (iter.index < start)
> > +                       continue;
> > +
> >                 indices[ret] = iter.index;
> >                 entries[ret] = page;
> >                 if (++ret == nr_entries)
> >
> > Is this a radix bug? I would never expect:
> >
> >     radix_tree_for_each_slot(slot, &mapping->i_pages, &iter, start)
> >
> > ...to return entries with an index < start. Without that change above
> > we loop forever because dax_layout_busy_page() can't make forward
> > progress. I'll dig into the radix code tomorrow, but maybe someone
> > else will be me to it overnight.
>
> If 'start' is within a 2MB entry, iter.index can absolutely be less
> than start.  I forget exactly what the radix tree code does, but I think
> it returns index set to the canonical/base index of the entry.

Ok, that makes sense. Then the bug is in dax_layout_busy_page() which
needs to increment 'index' by the entry size. This might also explain
why not every run sees it because you may get lucky and have a 4K
entry.

Dan Williams Oct. 5, 2018, 4:28 a.m. UTC | #8

On Thu, Oct 4, 2018 at 9:01 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Thu, Oct 4, 2018 at 7:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Oct 04, 2018 at 06:57:52PM -0700, Dan Williams wrote:
> > > On Thu, Oct 4, 2018 at 9:27 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Thu 27-09-18 11:22:22, Dan Williams wrote:
> > > > > On Thu, Sep 27, 2018 at 6:41 AM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > > > > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > > > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> > > > > > > > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > > > > > > > against itself when retrying to grab the entry lock again. Fix the
> > > > > > > > problem by unlocking mapping->i_pages before retrying.
> > > > > > >
> > > > > > > It seems weird that xfstests doesn't provoke this ...
> > > > > >
> > > > > > The function currently gets called only from mm/memory-failure.c. And yes,
> > > > > > we are lacking DAX hwpoison error tests in fstests...
> > > > >
> > > > > I have an item on my backlog to port the ndctl unit test that does
> > > > > memory_failure() injection vs ext4 over to fstests. That said I've
> > > > > been investigating a deadlock on ext4 caused by this test. When I saw
> > > > > this patch I hoped it was root cause, but the test is still failing
> > > > > for me. Vishal is able to pass the test on his system, so the failure
> > > > > mode is timing dependent. I'm running this patch on top of -rc5 and
> > > > > still seeing the following deadlock.
> > > >
> > > > I went through the code but I don't see where the problem could be. How can
> > > > I run that test? Is KVM enough or do I need hardware with AEP dimms?
> > >
> > > KVM is enough... however, I have found a hack that makes the test pass:
> > >
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 52517f28e6f4..d7f035b1846e 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -1668,6 +1668,9 @@ unsigned find_get_entries(struct address_space *mapping,
> > >                         goto repeat;
> > >                 }
> > >  export:
> > > +               if (iter.index < start)
> > > +                       continue;
> > > +
> > >                 indices[ret] = iter.index;
> > >                 entries[ret] = page;
> > >                 if (++ret == nr_entries)
> > >
> > > Is this a radix bug? I would never expect:
> > >
> > >     radix_tree_for_each_slot(slot, &mapping->i_pages, &iter, start)
> > >
> > > ...to return entries with an index < start. Without that change above
> > > we loop forever because dax_layout_busy_page() can't make forward
> > > progress. I'll dig into the radix code tomorrow, but maybe someone
> > > else will be me to it overnight.
> >
> > If 'start' is within a 2MB entry, iter.index can absolutely be less
> > than start.  I forget exactly what the radix tree code does, but I think
> > it returns index set to the canonical/base index of the entry.
>
> Ok, that makes sense. Then the bug is in dax_layout_busy_page() which
> needs to increment 'index' by the entry size. This might also explain
> why not every run sees it because you may get lucky and have a 4K
> entry.

Hmm, no 2MB entry here.

We go through the first find_get_entries and export:

    export start: 0x0 index: 0x0 page: 0x822000a
    export start: 0x0 index: 0x200 page: 0xcc3801a

Then dax_layout_busy_page sets 'start' to 0x201, and find_get_entries returns:

    export start: 0x201 index: 0x200 page: 0xcc3801a

...forevermore.

Jan Kara Oct. 5, 2018, 9:54 a.m. UTC | #9

On Thu 04-10-18 21:28:14, Dan Williams wrote:
> On Thu, Oct 4, 2018 at 9:01 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Thu, Oct 4, 2018 at 7:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Oct 04, 2018 at 06:57:52PM -0700, Dan Williams wrote:
> > > > On Thu, Oct 4, 2018 at 9:27 AM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Thu 27-09-18 11:22:22, Dan Williams wrote:
> > > > > > On Thu, Sep 27, 2018 at 6:41 AM Jan Kara <jack@suse.cz> wrote:
> > > > > > >
> > > > > > > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > > > > > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > > > > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> > > > > > > > > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > > > > > > > > against itself when retrying to grab the entry lock again. Fix the
> > > > > > > > > problem by unlocking mapping->i_pages before retrying.
> > > > > > > >
> > > > > > > > It seems weird that xfstests doesn't provoke this ...
> > > > > > >
> > > > > > > The function currently gets called only from mm/memory-failure.c. And yes,
> > > > > > > we are lacking DAX hwpoison error tests in fstests...
> > > > > >
> > > > > > I have an item on my backlog to port the ndctl unit test that does
> > > > > > memory_failure() injection vs ext4 over to fstests. That said I've
> > > > > > been investigating a deadlock on ext4 caused by this test. When I saw
> > > > > > this patch I hoped it was root cause, but the test is still failing
> > > > > > for me. Vishal is able to pass the test on his system, so the failure
> > > > > > mode is timing dependent. I'm running this patch on top of -rc5 and
> > > > > > still seeing the following deadlock.
> > > > >
> > > > > I went through the code but I don't see where the problem could be. How can
> > > > > I run that test? Is KVM enough or do I need hardware with AEP dimms?
> > > >
> > > > KVM is enough... however, I have found a hack that makes the test pass:
> > > >
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 52517f28e6f4..d7f035b1846e 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -1668,6 +1668,9 @@ unsigned find_get_entries(struct address_space *mapping,
> > > >                         goto repeat;
> > > >                 }
> > > >  export:
> > > > +               if (iter.index < start)
> > > > +                       continue;
> > > > +
> > > >                 indices[ret] = iter.index;
> > > >                 entries[ret] = page;
> > > >                 if (++ret == nr_entries)
> > > >
> > > > Is this a radix bug? I would never expect:
> > > >
> > > >     radix_tree_for_each_slot(slot, &mapping->i_pages, &iter, start)
> > > >
> > > > ...to return entries with an index < start. Without that change above
> > > > we loop forever because dax_layout_busy_page() can't make forward
> > > > progress. I'll dig into the radix code tomorrow, but maybe someone
> > > > else will be me to it overnight.
> > >
> > > If 'start' is within a 2MB entry, iter.index can absolutely be less
> > > than start.  I forget exactly what the radix tree code does, but I think
> > > it returns index set to the canonical/base index of the entry.
> >
> > Ok, that makes sense. Then the bug is in dax_layout_busy_page() which
> > needs to increment 'index' by the entry size. This might also explain
> > why not every run sees it because you may get lucky and have a 4K
> > entry.
> 
> Hmm, no 2MB entry here.
> 
> We go through the first find_get_entries and export:
> 
>     export start: 0x0 index: 0x0 page: 0x822000a
>     export start: 0x0 index: 0x200 page: 0xcc3801a
> 
> Then dax_layout_busy_page sets 'start' to 0x201, and find_get_entries returns:
> 
>     export start: 0x201 index: 0x200 page: 0xcc3801a
> 
> ...forevermore.

Are you sure there's not 2MB entry starting at index 0x200? Because if
there was, we'd get infinite loop exactly as you describe in
dax_layout_busy_page() AFAICT. And it seems to me lot of other places
iterating over entries are borked in a similar way as they all assume that
doing +1 to the current index is guaranteeing them forward progress. Now
actual breakage resulting from this is limited as only DAX uses multiorder
entries and thus not many of these iterators actually ever get called for
radix tree with multiorder entries (e.g. tmpfs still inserts every 4k subpage
of THP into the radix tree and iteration functions usually handles
tail subpages in a special way). But this would really deserve larger
cleanup.

								Honza

Dan Williams Oct. 6, 2018, 6:04 p.m. UTC | #10

On Fri, Oct 5, 2018 at 2:56 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 04-10-18 21:28:14, Dan Williams wrote:
> > On Thu, Oct 4, 2018 at 9:01 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Thu, Oct 4, 2018 at 7:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Thu, Oct 04, 2018 at 06:57:52PM -0700, Dan Williams wrote:
> > > > > On Thu, Oct 4, 2018 at 9:27 AM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > On Thu 27-09-18 11:22:22, Dan Williams wrote:
> > > > > > > On Thu, Sep 27, 2018 at 6:41 AM Jan Kara <jack@suse.cz> wrote:
> > > > > > > >
> > > > > > > > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > > > > > > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > > > > > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> > > > > > > > > > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > > > > > > > > > against itself when retrying to grab the entry lock again. Fix the
> > > > > > > > > > problem by unlocking mapping->i_pages before retrying.
> > > > > > > > >
> > > > > > > > > It seems weird that xfstests doesn't provoke this ...
> > > > > > > >
> > > > > > > > The function currently gets called only from mm/memory-failure.c. And yes,
> > > > > > > > we are lacking DAX hwpoison error tests in fstests...
> > > > > > >
> > > > > > > I have an item on my backlog to port the ndctl unit test that does
> > > > > > > memory_failure() injection vs ext4 over to fstests. That said I've
> > > > > > > been investigating a deadlock on ext4 caused by this test. When I saw
> > > > > > > this patch I hoped it was root cause, but the test is still failing
> > > > > > > for me. Vishal is able to pass the test on his system, so the failure
> > > > > > > mode is timing dependent. I'm running this patch on top of -rc5 and
> > > > > > > still seeing the following deadlock.
> > > > > >
> > > > > > I went through the code but I don't see where the problem could be. How can
> > > > > > I run that test? Is KVM enough or do I need hardware with AEP dimms?
> > > > >
> > > > > KVM is enough... however, I have found a hack that makes the test pass:
> > > > >
> > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > index 52517f28e6f4..d7f035b1846e 100644
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -1668,6 +1668,9 @@ unsigned find_get_entries(struct address_space *mapping,
> > > > >                         goto repeat;
> > > > >                 }
> > > > >  export:
> > > > > +               if (iter.index < start)
> > > > > +                       continue;
> > > > > +
> > > > >                 indices[ret] = iter.index;
> > > > >                 entries[ret] = page;
> > > > >                 if (++ret == nr_entries)
> > > > >
> > > > > Is this a radix bug? I would never expect:
> > > > >
> > > > >     radix_tree_for_each_slot(slot, &mapping->i_pages, &iter, start)
> > > > >
> > > > > ...to return entries with an index < start. Without that change above
> > > > > we loop forever because dax_layout_busy_page() can't make forward
> > > > > progress. I'll dig into the radix code tomorrow, but maybe someone
> > > > > else will be me to it overnight.
> > > >
> > > > If 'start' is within a 2MB entry, iter.index can absolutely be less
> > > > than start.  I forget exactly what the radix tree code does, but I think
> > > > it returns index set to the canonical/base index of the entry.
> > >
> > > Ok, that makes sense. Then the bug is in dax_layout_busy_page() which
> > > needs to increment 'index' by the entry size. This might also explain
> > > why not every run sees it because you may get lucky and have a 4K
> > > entry.
> >
> > Hmm, no 2MB entry here.
> >
> > We go through the first find_get_entries and export:
> >
> >     export start: 0x0 index: 0x0 page: 0x822000a
> >     export start: 0x0 index: 0x200 page: 0xcc3801a
> >
> > Then dax_layout_busy_page sets 'start' to 0x201, and find_get_entries returns:
> >
> >     export start: 0x201 index: 0x200 page: 0xcc3801a
> >
> > ...forevermore.
>
> Are you sure there's not 2MB entry starting at index 0x200? Because if
> there was, we'd get infinite loop exactly as you describe in
> dax_layout_busy_page()

My debug code was buggy, these are 2MB entries, I'll send out a fix.

> AFAICT. And it seems to me lot of other places
> iterating over entries are borked in a similar way as they all assume that
> doing +1 to the current index is guaranteeing them forward progress. Now
> actual breakage resulting from this is limited as only DAX uses multiorder
> entries and thus not many of these iterators actually ever get called for
> radix tree with multiorder entries (e.g. tmpfs still inserts every 4k subpage
> of THP into the radix tree and iteration functions usually handles
> tail subpages in a special way). But this would really deserve larger
> cleanup.

Yeah, it's a subtle detail waiting to trip up new multi-order-radix
users. The shift reported in the iterator is 6 in this case.

dax: Fix deadlock in dax_lock_mapping_entry()

Commit Message

Comments

Patch