mbox series

[0/4] Remove nrexceptional tracking

Message ID 20200804161755.10100-1-willy@infradead.org
Headers show
Series Remove nrexceptional tracking | expand

Message

Matthew Wilcox (Oracle) Aug. 4, 2020, 4:17 p.m. UTC
We actually use nrexceptional for very little these days.  It's a constant
source of pain with the THP patches because we don't know how large a
shadow entry is, so either we have to ask the xarray how many indices
it covers, or store that information in the shadow entry (and reduce
the amount of other information in the shadow entry proportionally).
While tracking down the most recent case of "evict tells me I've got
the accounting wrong again", I wondered if it might not be simpler to
just remove it.  So here's a patch set to do just that.  I think each
of these patches is an improvement in isolation, but the combination of
all four is larger than the sum of its parts.

I'm running xfstests on this patchset right now.  If one of the DAX
people could try it out, that'd be fantastic.

Matthew Wilcox (Oracle) (4):
  mm: Introduce and use page_cache_empty
  mm: Stop accounting shadow entries
  dax: Account DAX entries as nrpages
  mm: Remove nrexceptional from inode

 fs/block_dev.c          |  2 +-
 fs/dax.c                |  8 ++++----
 fs/inode.c              |  2 +-
 include/linux/fs.h      |  2 --
 include/linux/pagemap.h |  5 +++++
 mm/filemap.c            | 15 ---------------
 mm/truncate.c           | 19 +++----------------
 mm/workingset.c         |  1 -
 8 files changed, 14 insertions(+), 40 deletions(-)

Comments

Verma, Vishal L Aug. 6, 2020, 7:44 p.m. UTC | #1
On Tue, 2020-08-04 at 17:17 +0100, Matthew Wilcox (Oracle) wrote:
> We actually use nrexceptional for very little these days.  It's a
> constant
> source of pain with the THP patches because we don't know how large a
> shadow entry is, so either we have to ask the xarray how many indices
> it covers, or store that information in the shadow entry (and reduce
> the amount of other information in the shadow entry proportionally).
> While tracking down the most recent case of "evict tells me I've got
> the accounting wrong again", I wondered if it might not be simpler to
> just remove it.  So here's a patch set to do just that.  I think each
> of these patches is an improvement in isolation, but the combination
> of
> all four is larger than the sum of its parts.
> 
> I'm running xfstests on this patchset right now.  If one of the DAX
> people could try it out, that'd be fantastic.
> 
> Matthew Wilcox (Oracle) (4):
>   mm: Introduce and use page_cache_empty
>   mm: Stop accounting shadow entries
>   dax: Account DAX entries as nrpages
>   mm: Remove nrexceptional from inode

Hi Matthew,

I applied these on top of 5.8 and ran them through the nvdimm unit test
suite, and saw some test failures. The first failing test signature is:

  + umount test_dax_mnt
  ./dax-ext4.sh: line 62: 15749 Segmentation fault      umount $MNT
  FAIL dax-ext4.sh (exit status: 139)

The line is: https://github.com/pmem/ndctl/blob/master/test/dax.sh#L79
And the failing umount happens right after 'run_test', which calls this:
https://github.com/pmem/ndctl/blob/master/test/dax-pmd.c


> 
>  fs/block_dev.c          |  2 +-
>  fs/dax.c                |  8 ++++----
>  fs/inode.c              |  2 +-
>  include/linux/fs.h      |  2 --
>  include/linux/pagemap.h |  5 +++++
>  mm/filemap.c            | 15 ---------------
>  mm/truncate.c           | 19 +++----------------
>  mm/workingset.c         |  1 -
>  8 files changed, 14 insertions(+), 40 deletions(-)
>
Verma, Vishal L Aug. 6, 2020, 8:16 p.m. UTC | #2
On Thu, 2020-08-06 at 19:44 +0000, Verma, Vishal L wrote:
> > 
> > I'm running xfstests on this patchset right now.  If one of the DAX
> > people could try it out, that'd be fantastic.
> > 
> > Matthew Wilcox (Oracle) (4):
> >   mm: Introduce and use page_cache_empty
> >   mm: Stop accounting shadow entries
> >   dax: Account DAX entries as nrpages
> >   mm: Remove nrexceptional from inode
> 
> Hi Matthew,
> 
> I applied these on top of 5.8 and ran them through the nvdimm unit test
> suite, and saw some test failures. The first failing test signature is:
> 
>   + umount test_dax_mnt
>   ./dax-ext4.sh: line 62: 15749 Segmentation fault      umount $MNT
>   FAIL dax-ext4.sh (exit status: 139)
> 
> The line is: https://github.com/pmem/ndctl/blob/master/test/dax.sh#L79
> And the failing umount happens right after 'run_test', which calls this:
> https://github.com/pmem/ndctl/blob/master/test/dax-pmd.c
> 
> 
And here is the associated kernel splat:

[   89.570142] EXT4-fs (pmem4): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[   89.573407] EXT4-fs (pmem4): mounted filesystem with ordered data mode. Opts: dax
[   90.450644] Injecting memory failure for pfn 0x148900 at process virtual address 0x7f2ec9b00000
[   90.452421] Memory failure: 0x148900: Sending SIGBUS to dax-pmd:16886 due to hardware memory corruption
[   90.454067] Memory failure: 0x148900: recovery action for dax page: Recovered
[   91.656624] ------------[ cut here ]------------
[   91.657822] kernel BUG at fs/inode.c:530!
[   91.658850] invalid opcode: 0000 [#1] SMP PTI
[   91.659883] CPU: 4 PID: 16891 Comm: umount Tainted: G           O      5.8.0-00004-g6161e2d6ac48 #38
[   91.661861] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[   91.664920] RIP: 0010:clear_inode+0x78/0x90
[   91.665934] Code: 00 a8 20 74 2b a8 40 75 29 48 8b 93 f0 01 00 00 48 8d 83 f0 01 00 00 48 39 c2 75 18 48 c7 83 e0 00 00 00 60 00 00 00 5b 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00
[   91.669979] RSP: 0018:ffffc9000212fd98 EFLAGS: 00010002
[   91.671151] RAX: 0000000000000004 RBX: ffff88810e33d470 RCX: 8c6318c6318c6319
[   91.672689] RDX: ffff888110b48040 RSI: 0000000000000004 RDI: 0000000000000046
[   91.674208] RBP: ffff88810e33d6b8 R08: 0000000000000000 R09: 0000000000000001
[   91.675760] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88810e33d6b0
[   91.677332] R13: ffff88811014c828 R14: ffff88811014c9c0 R15: ffff88810e3398c0
[   91.678940] FS:  00007f86a7e74c80(0000) GS:ffff888117c00000(0000) knlGS:0000000000000000
[   91.680745] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   91.682001] CR2: 000055bcfb4c3008 CR3: 0000000110b7c000 CR4: 00000000000006e0
[   91.683581] Call Trace:
[   91.684268]  ext4_clear_inode+0x16/0x80
[   91.685298]  ext4_evict_inode+0x5f/0x610
[   91.686245]  evict+0xcf/0x1f0
[   91.687017]  dispose_list+0x48/0x70
[   91.687937]  evict_inodes+0x152/0x190
[   91.688918]  generic_shutdown_super+0x37/0x100
[   91.689880]  kill_block_super+0x21/0x50
[   91.690784]  deactivate_locked_super+0x36/0xa0
[   91.691790]  cleanup_mnt+0x12d/0x190
[   91.692658]  task_work_run+0x5c/0xa0
[   91.694739]  __prepare_exit_to_usermode+0x1b6/0x1c0
[   91.695806]  do_syscall_64+0x5e/0xb0
[   91.696656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   91.697798] RIP: 0033:0x7f86a80c694b
[   91.698650] Code: 15 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1d 15 0c 00 f7 d8 64 89 01 48
[   91.702545] RSP: 002b:00007ffd49a8cb98 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[   91.704204] RAX: 0000000000000000 RBX: 00007f86a81ee204 RCX: 00007f86a80c694b
[   91.705732] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000055bcfb4bf2a0
[   91.707207] RBP: 000055bcfb4bafa0 R08: 0000000000000000 R09: 00007f86a8188a40
[   91.708809] R10: 000055bcfb4c1550 R11: 0000000000000246 R12: 0000000000000000
[   91.710040] R13: 000055bcfb4bf2a0 R14: 000055bcfb4bb0b0 R15: 000055bcfb4bb1d0
[   91.711193] Modules linked in: nd_e820(O) nd_blk(O) nfit(O) kmem nd_pmem(O) nd_btt(O) dax_pmem(O) dax_pmem_core(O) libnvdimm(O) device_dax(O) nfit_test_iomap(O) [last unloaded: nfit_test_iomap]
[   91.714154] ---[ end trace 016cb116e8654993 ]---
[   91.715011] RIP: 0010:clear_inode+0x78/0x90
[   91.715803] Code: 00 a8 20 74 2b a8 40 75 29 48 8b 93 f0 01 00 00 48 8d 83 f0 01 00 00 48 39 c2 75 18 48 c7 83 e0 00 00 00 60 00 00 00 5b 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00
[   91.718950] RSP: 0018:ffffc9000212fd98 EFLAGS: 00010002
[   91.719976] RAX: 0000000000000004 RBX: ffff88810e33d470 RCX: 8c6318c6318c6319
[   91.721187] RDX: ffff888110b48040 RSI: 0000000000000004 RDI: 0000000000000046
[   91.722473] RBP: ffff88810e33d6b8 R08: 0000000000000000 R09: 0000000000000001
[   91.723633] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88810e33d6b0
[   91.724838] R13: ffff88811014c828 R14: ffff88811014c9c0 R15: ffff88810e3398c0
[   91.726109] FS:  00007f86a7e74c80(0000) GS:ffff888117c00000(0000) knlGS:0000000000000000
[   91.727575] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   91.728596] CR2: 000055bcfb4c3008 CR3: 0000000110b7c000 CR4: 00000000000006e0
[   91.729794] note: umount[16891] exited with preempt_count 1
[   91.730746] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49
[   91.732304] in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 16891, name: umount
[   91.733764] INFO: lockdep is turned off.
[   91.734476] irq event stamp: 6192
[   91.735067] hardirqs last  enabled at (6191): [<ffffffff81c7a974>] _raw_spin_unlock_irq+0x24/0x40
[   91.736641] hardirqs last disabled at (6192): [<ffffffff81c7ab27>] _raw_spin_lock_irq+0x17/0x80
[   91.738141] softirqs last  enabled at (6120): [<ffffffff8134b3b8>] bdi_split_work_to_wbs+0x248/0x580
[   91.739752] softirqs last disabled at (6116): [<ffffffff81347a5d>] wb_queue_work+0x4d/0x180
[   91.741297] CPU: 4 PID: 16891 Comm: umount Tainted: G      D    O      5.8.0-00004-g6161e2d6ac48 #38
[   91.742940] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[   91.744887] Call Trace:
[   91.745401]  dump_stack+0x92/0xc8
[   91.745987]  ___might_sleep.cold+0xb6/0xc6
[   91.746716]  exit_signals+0x1c/0x2d0
[   91.747355]  do_exit+0xc0/0xba0
[   91.747942]  ? __prepare_exit_to_usermode+0x1b6/0x1c0
[   91.748821]  rewind_stack_do_exit+0x17/0x20
[   91.749619] RIP: 0033:0x7f86a80c694b
[   91.750330] Code: 15 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1d 15 0c 00 f7 d8 64 89 01 48
[   91.753544] RSP: 002b:00007ffd49a8cb98 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[   91.754902] RAX: 0000000000000000 RBX: 00007f86a81ee204 RCX: 00007f86a80c694b
[   91.756176] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000055bcfb4bf2a0
[   91.757746] RBP: 000055bcfb4bafa0 R08: 0000000000000000 R09: 00007f86a8188a40
[   91.759034] R10: 000055bcfb4c1550 R11: 0000000000000246 R12: 0000000000000000
[   91.760166] R13: 000055bcfb4bf2a0 R14: 000055bcfb4bb0b0 R15: 000055bcfb4bb1d0
Matthew Wilcox (Oracle) Oct. 8, 2020, 7:33 p.m. UTC | #3
On Thu, Aug 06, 2020 at 08:16:02PM +0000, Verma, Vishal L wrote:
> On Thu, 2020-08-06 at 19:44 +0000, Verma, Vishal L wrote:
> > > 
> > > I'm running xfstests on this patchset right now.  If one of the DAX
> > > people could try it out, that'd be fantastic.
> > > 
> > > Matthew Wilcox (Oracle) (4):
> > >   mm: Introduce and use page_cache_empty
> > >   mm: Stop accounting shadow entries
> > >   dax: Account DAX entries as nrpages
> > >   mm: Remove nrexceptional from inode
> > 
> > Hi Matthew,
> > 
> > I applied these on top of 5.8 and ran them through the nvdimm unit test
> > suite, and saw some test failures. The first failing test signature is:
> > 
> >   + umount test_dax_mnt
> >   ./dax-ext4.sh: line 62: 15749 Segmentation fault      umount $MNT
> >   FAIL dax-ext4.sh (exit status: 139)

Thanks.  Fixed:

+++ b/fs/dax.c
@@ -644,7 +644,7 @@ static int __dax_invalidate_entry(struct address_space *mapping,
                goto out;
        dax_disassociate_entry(entry, mapping, trunc);
        xas_store(&xas, NULL);
-       mapping->nrpages -= dax_entry_order(entry);
+       mapping->nrpages -= 1UL << dax_entry_order(entry);
        ret = 1;
 out:
        put_unlocked_entry(&xas, entry);

Updated git tree at
https://git.infradead.org/users/willy/pagecache.git/

It survives an xfstests run on an fsdax namespace which supports 2MB pages.
Verma, Vishal L Oct. 9, 2020, 11:04 p.m. UTC | #4
On Thu, 2020-10-08 at 20:33 +0100, Matthew Wilcox wrote:
> On Thu, Aug 06, 2020 at 08:16:02PM +0000, Verma, Vishal L wrote:
> > On Thu, 2020-08-06 at 19:44 +0000, Verma, Vishal L wrote:
> > > > I'm running xfstests on this patchset right now.  If one of the DAX
> > > > people could try it out, that'd be fantastic.
> > > > 
> > > > Matthew Wilcox (Oracle) (4):
> > > >   mm: Introduce and use page_cache_empty
> > > >   mm: Stop accounting shadow entries
> > > >   dax: Account DAX entries as nrpages
> > > >   mm: Remove nrexceptional from inode
> > > 
> > > Hi Matthew,
> > > 
> > > I applied these on top of 5.8 and ran them through the nvdimm unit test
> > > suite, and saw some test failures. The first failing test signature is:
> > > 
> > >   + umount test_dax_mnt
> > >   ./dax-ext4.sh: line 62: 15749 Segmentation fault      umount $MNT
> > >   FAIL dax-ext4.sh (exit status: 139)
> 
> Thanks.  Fixed:
> 
> +++ b/fs/dax.c
> @@ -644,7 +644,7 @@ static int __dax_invalidate_entry(struct address_space *mapping,
>                 goto out;
>         dax_disassociate_entry(entry, mapping, trunc);
>         xas_store(&xas, NULL);
> -       mapping->nrpages -= dax_entry_order(entry);
> +       mapping->nrpages -= 1UL << dax_entry_order(entry);
>         ret = 1;
>  out:
>         put_unlocked_entry(&xas, entry);
> 
> Updated git tree at
> https://git.infradead.org/users/willy/pagecache.git/

I ran this tree through the unit tests, and everything passes.
(Well, while the tests passed, this tree as-is did have an RCU warning
splat. I rebased to v5.9-rc8 and that was fine).

Feel free to add:

Tested-by: Vishal Verma <vishal.l.verma@intel.com>