Quota-enabled XFS hangs during mount

On Tue, Jan 24, 2017 at 02:17:36PM +0100, Martin Svec wrote:
> Hello,
> 
> Dne 23.1.2017 v 14:44 Brian Foster napsal(a):
> > On Mon, Jan 23, 2017 at 10:44:20AM +0100, Martin Svec wrote:
> >> Hello Dave,
> >>
> >> Any updates on this? It's a bit annoying to workaround the bug by increasing RAM just because of the
> >> initial quotacheck.
> >>
> > Note that Dave is away on a bit of an extended vacation[1]. It looks
> > like he was in the process of fishing through the code to spot any
> > potential problems related to quotacheck+reclaim. I see you've cc'd him
> > directly so we'll see if we get a response wrt to if he got anywhere
> > with that...
> >
> > Skimming back through this thread, it looks like we have an issue where
> > quota check is not quite reliable in the event of reclaim, and you
> > appear to be reproducing this due to a probably unique combination of
> > large inode count and low memory.
> >
> > Is my understanding correct that you've reproduced this on more recent
> > kernels than the original report? 
> 
> Yes, I repeated the tests using 4.9.3 kernel on another VM where we hit this issue.
> 
> Configuration:
> * vSphere 5.5 virtual machine, 2 vCPUs, virtual disks residing on iSCSI VMFS datastore
> * Debian Jessie 64 bit webserver, vanilla kernel 4.9.3
> * 180 GB XFS data disk mounted as /www
> 
> Quotacheck behavior depends on assigned RAM:
> * 2 or less GiB: mount /www leads to a storm of OOM kills including shell, ttys etc., so the system
> becomes unusable.
> * 3 GiB: mount /www task hangs in the same way as I reported in earlier in this thread.
> * 4 or more GiB: mount /www succeeds.
> 

I was able to reproduce the quotacheck OOM situation on latest kernels.
This problem actually looks like a regression as of commit 17c12bcd3
("xfs: when replaying bmap operations, don't let unlinked inodes get
reaped"), but I don't think that patch is the core problem. That patch
pulled up setting MS_ACTIVE on the superblock from after XFS runs
quotacheck to before it (for other reasons), which has a side effect of
causing inodes to be placed onto the lru once they are released. Before
this change, all inodes were immediately marked for reclaim once
released from quotacheck because the superblock had not been set active.

The problem here is first that quotacheck issues a bulkstat and thus
grabs and releases every inode in the fs. The quotacheck occurs at mount
time, which means we still hold the s_umount lock and thus the shrinker
cannot run even though it is registered. Therefore, we basically just
populate the lru until we've consumed too much memory and blow up.

I think the solution here is to preserve the quotacheck behavior prior
to commit 17c12bcd3 via something like the following:

... which allows quotacheck to run as normal in my quick tests. Could
you try this on your more recent kernel tests and see whether you still
reproduce any problems?

Note that it looks like the problem reproduced on very old kernels might
be different and probably won't be addressed by this change since the
regression wasn't introduced until v4.9. If you don't reproduce other
problems on the v4.9 based kernel, you might have to track down what
fixed the original problem manually if you want to remain on the old
kernel.

Brian

> The affected disk is checked using xfs_repair. I keep a VM snapshot to be able to reproduce the bug.
> Below is updated filesystem information and dmesg output:
> 
> ---------
> xfs-test:~# df -i
> Filesystem        Inodes   IUsed     IFree IUse% Mounted on
> /dev/sdd1      165312432 2475753 162836679    2% /www
> 
> ---------
> xfs-test:~# xfs_info /www
> meta-data=/dev/sdd1              isize=256    agcount=73, agsize=655232 blks
>          =                       sectsz=512   attr=2, projid32bit=0
>          =                       crc=0        finobt=0
> data     =                       bsize=4096   blocks=47185664, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
> log      =internal               bsize=4096   blocks=2560, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> ---------
> slabtop, 3 GiB RAM:
> 
>  Active / Total Objects (% used)    : 3447273 / 3452076 (99.9%)
>  Active / Total Slabs (% used)      : 648365 / 648371 (100.0%)
>  Active / Total Caches (% used)     : 70 / 124 (56.5%)
>  Active / Total Size (% used)       : 2592192.04K / 2593485.27K (100.0%)
>  Minimum / Average / Maximum Object : 0.02K / 0.75K / 4096.00K
> 
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 2477104 2477101  99%    1.00K 619276        4   2477104K xfs_inode
> 631904 631840  99%    0.03K   5096      124     20384K kmalloc-32
>  74496  74492  99%    0.06K   1164       64      4656K kmalloc-64
>  72373  72367  99%    0.56K  10339        7     41356K radix_tree_node
>  38410  38314  99%    0.38K   3841       10     15364K mnt_cache
>  31360  31334  99%    0.12K    980       32      3920K kmalloc-96
>  27574  27570  99%    0.12K    811       34      3244K kernfs_node_cache
>  19152  18291  95%    0.19K    912       21      3648K dentry
>  17312  17300  99%    0.12K    541       32      2164K kmalloc-node
>  14546  13829  95%    0.57K   2078        7      8312K inode_cache
>  11088  11088 100%    0.19K    528       21      2112K kmalloc-192
>   5432   5269  96%    0.07K     97       56       388K Acpi-Operand
>   3960   3917  98%    0.04K     40       99       160K Acpi-Namespace
>   3624   3571  98%    0.50K    453        8      1812K kmalloc-512
>   3320   3249  97%    0.05K     40       83       160K ftrace_event_field
>   3146   3048  96%    0.18K    143       22       572K vm_area_struct
>   2752   2628  95%    0.06K     43       64       172K anon_vma_chain
>   2640   1991  75%    0.25K    165       16       660K kmalloc-256
>   1748   1703  97%    0.09K     38       46       152K trace_event_file
>   1568   1400  89%    0.07K     28       56       112K anon_vma
>   1086   1035  95%    0.62K    181        6       724K proc_inode_cache
>    935    910  97%    0.67K     85       11       680K shmem_inode_cache
>    786    776  98%    2.00K    393        2      1572K kmalloc-2048
>    780    764  97%    1.00K    195        4       780K kmalloc-1024
>    525    341  64%    0.19K     25       21       100K cred_jar
>    408    396  97%    0.47K     51        8       204K xfs_da_state
>    336    312  92%    0.62K     56        6       224K sock_inode_cache
>    309    300  97%    2.05K    103        3       824K idr_layer_cache
>    256    176  68%    0.12K      8       32        32K pid
>    240      2   0%    0.02K      1      240         4K jbd2_revoke_table_s
>    231    231 100%    4.00K    231        1       924K kmalloc-4096
>    230    222  96%    3.31K    115        2       920K task_struct
>    224    205  91%    1.06K     32        7       256K signal_cache
>    213     26  12%    0.05K      3       71        12K Acpi-Parse
>    213    213 100%    2.06K     71        3       568K sighand_cache
>    189     97  51%    0.06K      3       63        12K fs_cache
>    187     86  45%    0.36K     17       11        68K blkdev_requests
>    163     63  38%    0.02K      1      163         4K numa_policy
> 
> ---------
> dmesg, 3 GiB RAM:
> 
> [  967.642413] INFO: task mount:669 blocked for more than 120 seconds.
> [  967.642456]       Tainted: G            E   4.9.3-znr1+ #24
> [  967.642510] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  967.642570] mount           D    0   669    652 0x00000000
> [  967.642573]  ffff8800b9b8ac00 0000000000000000 ffffffffa800e540 ffff880036b85200
> [  967.642575]  ffff8800bb618740 ffffc90000f87998 ffffffffa7a2802d ffff8800ba38e000
> [  967.642577]  ffffc90000f87998 00000000c021fd94 0002000000000000 ffff880036b85200
> [  967.642579] Call Trace:
> [  967.642586]  [<ffffffffa7a2802d>] ? __schedule+0x23d/0x6e0
> [  967.642588]  [<ffffffffa7a28506>] schedule+0x36/0x80
> [  967.642590]  [<ffffffffa7a2bbac>] schedule_timeout+0x21c/0x3c0
> [  967.642592]  [<ffffffffa774c3ab>] ? __radix_tree_lookup+0x7b/0xe0
> [  967.642594]  [<ffffffffa7a28fbb>] wait_for_completion+0xfb/0x140
> [  967.642596]  [<ffffffffa74ae1f0>] ? wake_up_q+0x70/0x70
> [  967.642654]  [<ffffffffc0225b32>] xfs_qm_flush_one+0x82/0xc0 [xfs]
> [  967.642684]  [<ffffffffc0225ab0>] ? xfs_qm_dqattach_one+0x120/0x120 [xfs]
> [  967.642712]  [<ffffffffc0225f1c>] xfs_qm_dquot_walk.isra.10+0xec/0x170 [xfs]
> [  967.642744]  [<ffffffffc0227f75>] xfs_qm_quotacheck+0x255/0x310 [xfs]
> [  967.642774]  [<ffffffffc0228114>] xfs_qm_mount_quotas+0xe4/0x170 [xfs]
> [  967.642800]  [<ffffffffc02042bd>] xfs_mountfs+0x62d/0x940 [xfs]
> [  967.642827]  [<ffffffffc0208eca>] xfs_fs_fill_super+0x40a/0x590 [xfs]
> [  967.642829]  [<ffffffffa761aa4a>] mount_bdev+0x17a/0x1b0
> [  967.642864]  [<ffffffffc0208ac0>] ? xfs_test_remount_options.isra.14+0x60/0x60 [xfs]
> [  967.642895]  [<ffffffffc0207b35>] xfs_fs_mount+0x15/0x20 [xfs]
> [  967.642897]  [<ffffffffa761b428>] mount_fs+0x38/0x170
> [  967.642900]  [<ffffffffa76390a4>] vfs_kern_mount+0x64/0x110
> [  967.642901]  [<ffffffffa763b7f5>] do_mount+0x1e5/0xcd0
> [  967.642903]  [<ffffffffa763b3ec>] ? copy_mount_options+0x2c/0x230
> [  967.642904]  [<ffffffffa763c5d4>] SyS_mount+0x94/0xd0
> [  967.642907]  [<ffffffffa7a2d0fb>] entry_SYSCALL_64_fastpath+0x1e/0xad
> 
> > If so and we don't hear back from Dave
> > in a reasonable time, it might be useful to provide a metadump of the fs
> > if possible. That would allow us to restore in a similar low RAM vm
> > configuration, trigger quota check and try to reproduce directly...
> 
> Unfortunately, the output of xfs_metadump apparently contains readable fragments of files! We cannot
> provide you such a dump from production server. Shouldn't metadump obfuscate metadata and ignore all
> filesystem data? Maybe it's a sign of filesystem corruption unrecognized by xfs_repair?
> 
> 
> Thank you,
> Martin
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Quota-enabled XFS hangs during mount

Commit Message

Comments

Patch